Previously, some extIndexDB metrics were not registered. It resulted
into missing metrics, if metric value was added to the extIndexDB. It's
a usual case for search requests at both indexes.
Current commit updates all metrics from extIndexDB according to the
current IndexDB. It must fix such cases
Related issue:
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6868
### Describe Your Changes
Please provide a brief description of the changes you made. Be as
specific as possible to help others understand the purpose and impact of
your modifications.
### Checklist
The following checks are **mandatory**:
- [ ] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
### Describe Your Changes
The anchor to "Other fields" section should be #other-fields (instead of
#other-field)
### Checklist
The following checks are **mandatory**:
- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
Signed-off-by: Cuong Le <cuongleqq@gmail.com>
`TL;DR` This PR improves the metric IDs search in IndexDB:
- Avoid seaching for metric IDs twice when `maxMetrics` limit is
exceeded
- Use correct error type for indicating that the `maxMetrics` limit is
exceded
- Simplify the logic of deciding between per-day and global index search
A unit test has been added to ensure that this refactoring does not
break anything.
---
Function calls before the fix:
```
idb.searchMetricIDs
|__ is.searchMetricIDs
|__ is.searchMetricIDsInternal
|__ is.updateMetricIDsForTagFilters
|__ is.tryUpdatingMetricIDsForDateRange
| |
|__ is.getMetricIDsForDateAndFilters
```
- `searchMetricIDsInternal` searches metric IDs for each filter set. It
maintains a metric ID set variable which is updated every time the
`updateMetricIDsForTagFilters` function is called. After each successful
call, the function checks the length of the updated metric ID set and if
it is greater than `maxMetrics`, the function returns `too many
timeseries` error.
- `updateMetricIDsForTagFilters` uses either per-day or global index to
search metric IDs for the given filter set. The decision of which index
to use is made is made within the `tryUpdatingMetricIDsForDateRange`
function and if it returns `fallback to global search` error then the
function uses global index by calling `getMetricIDsForDateAndFilters`
with zero date.
- `tryUpdatingMetricIDsForDateRange` first checks if the given time
range is larger than 40 days and if so returns `fallback to global
search` error. Otherwise it proceeds to searching for metric IDs within
that time range by calling `getMetricIDsForDateAndFilters` for each
date.
- `getMetricIDsForDateAndFilters` searches for metric IDs for the given
date and returns `fallback to global search` error if the number of
found metric IDs is greater than `maxMetrics`.
Problems with this solution:
1. The `fallback to global search` error returned by
`getMetricIDsForDateAndFilters` in case when maxMetrics is exceeded is
misleading.
2. If `tryUpdatingMetricIDsForDateRange` proceeds to date range search
and returns `fallback to global search` error (because
`getMetricIDsForDateAndFilters` returns it) then this will trigger
global search in `updateMetricIDsForTagFilters`. However the global
search uses the same maxMetrics value which means this search is
destined to fail too. I.e. the same search is performed twice and fails
twice.
3. `too many timeseries` error is already handled in
`searchMetricIDsInternal` and therefore handing this error in
`updateMetricIDsForTagFilters` is redundant
4. updateMetricIDsForTagFilters is a better place to make a decision on
whether to use per-day or global index.
Solution:
1. Use a dedicated error for `too many timeseries` case
2. Handle `too many timeseries` error in `searchMetricIDsInternal` only
3. Move the per-day or global search decision from
`tryUpdatingMetricIDsForDateRange` to `updateMetricIDsForTagFilters` and
remove `fallback to global search` error.
---------
Signed-off-by: Artem Fetishev <wwctrsrx@gmail.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
Once the timeseries is in tsidCache, new entries won't be created in
per-day index because the RegisterMetricNames() code does consider
different dates for the same timeseries. So this case has been added.
The same bug exists for AddRows() but it is not manifested because the
index entries are finally created in updatePerDateData().
RegisterMetricNames also updated to increase the newTimeseriesCreated
counter because it actually creates new time series in index.
A unit tests has been added that check all possible data patterns
(different metric names and dates) and code branches in both
RegisterMetricNames and AddRows. The total number of new unit tests is
around 100 which increaded the running time of storage tests by 50%.
---------
Signed-off-by: Artem Fetishev <wwctrsrx@gmail.com>
Co-authored-by: Roman Khavronenko <hagen1778@gmail.com>
### Describe Your Changes
Reduced the scope of rowsAddedTotal variable from global to Storage.
This metric clearly belongs to a given Storage object as it counts the
number of records added by a given Storage instance.
Reducing the scope improves the incapsulation and allows to reset this
variable during the unit tests (i.e. every time a new Storage object is
created by a test, that object gets a new variable).
Signed-off-by: Artem Fetishev <wwctrsrx@gmail.com>
### Describe Your Changes
This is an attempt to document IndexDB. I guess I was trying to touch
the important points that might be of interest for the end users while
refraining from making it too detailed (such as I did not enumerate and
describe all the specific record types).
Please take a look and any suggestions are very welcome.
### Checklist
The following checks are **mandatory**:
- [x ] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
---------
Signed-off-by: Artem Fetishev <wwctrsrx@gmail.com>
In the previous commit 8958cecad6
the default ports (80/443) were removed for both the `scrapeURL` and
`instance` label values for those targets without a port in
`__address__`. Different values in the `instance` label generate new
time series.
This commit reverts the changes made to the `instance` label. Now,
for those targets:
- `scrapeURL` will remain unchanged.
- The `instance` label value will include the default port.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6792
### Describe Your Changes
release notes for patches 1.15.6 - 9
### Checklist
The following checks are **mandatory**:
- [ ] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
add command-line flag `-search.inmemoryBufSizeBytes` for configuring size of in-memory buffers used by vmselect during processing of vmstorage responses. A new summary metric `vm_tmp_blocks_inmemory_file_size_bytes` is exposed to show the size of the buffer during requests processing.
The new setting can be used by experienced users to adjust memory usage by vmselect when processing
many small read requests. Instead of allocating 4MB buffers each time, vmselect can be instructed to lower
the buffer size via `-search.inmemoryBufSizeBytes`. To make the decision whether this flag needs to be adjusted
users can consult with `vm_tmp_blocks_inmemory_file_size_bytes` which shows the actual size of buffers used
during query processing.
----------
The detailed information of this PR can be found in
https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6851
### Checklist
The following checks are **mandatory**:
- [ ] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
---------
Co-authored-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit cab3ef8294)
## Describe Your Changes
Add RemoteWrite Retry Controls
This PR introduces two new flags to the remote write functionality:
- remoteWrite.retryMinInterval
- remoteWrite.retryMaxTime
These flags provide finer control over the retry behavior for
remoteWrite operations, allowing users to customize the minimum interval
between retries and the maximum duration for retry attempts.
Fixes#5486.
## Checklist
- [x] The following checks are mandatory:
My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
---------
Signed-off-by: Yury Akudovich <ya@matterlabs.dev>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
### Describe Your Changes
Adds a link to the API example section that describes how to delete
metrics on VM Single.
### Checklist
The following checks are **mandatory**:
- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
This change is made in attempt to reduce memory usage by vmalert when
parsing big instant responses from VM/Prometheus.
In
a5c427bac4
vmalert switched from std json lib to fastjson lib in order to reduce
amount of allocations, as according to highloaded profiles of vmalert
the CPU is mostly spent on GC.
But switching to fastjson resulted into excessive memory usage for cases
when vmalert has to parse long json lines, which usually happens when
instant response contains many `metric` objects.
In this change we do a mixed parsing:
1. Slice of `metric` objects is parsed with std lib to keep mem low
2. Each `metric` object is parsed with fastjson to reduce allocs
The benchmark results are the following:
```
pkg: github.com/VictoriaMetrics/VictoriaMetrics/app/vmalert/datasource
BenchmarkParsePrometheusResponse/Instant_std+fastjson-10 1760 668959 ns/op 280147 B/op 5781 allocs/op
MBs allocated at heap: 493.078392
mallocs: 18655472
BenchmarkParsePrometheusResponse/Instant_fastjson-10 6109 198258 ns/op 172839 B/op 5548 allocs/op
MBs allocated at heap: 1056.384464
mallocs: 34457184
BenchmarkParsePrometheusResponse/Instant_std-10 1287 950987 ns/op 451677 B/op 9619 allocs/op
MBs allocated at heap: 580.802976
mallocs: 13351636
```
The benchmark function code with mem measurement is available here
https://gist.github.com/hagen1778/b9c3ca7f8ca7d6b21aec9777112c5810
The benchmark contains 3 results:
1. Instant_std+fastjson is the implementation in this change
2. Instant_fastjson-10 is the implementation from
a5c427bac4
3. BenchmarkParsePrometheusResponse/Instant_std-10 is implementation
before
a5c427bac4
According to these results, this new implementation is slower than
previous, but faster than before switching to fastjson. It also has
lower number of allocations and roughly the same memory allocation on
heap with GC turned off.
---------
Other changes:
1. rm BenchmarkMetrics as it doesn't measure anything
2. simplify BenchmarkParsePrometheusResponse into
BenchmarkPromInstantUnmarshal
### Describe Your Changes
Please provide a brief description of the changes you made. Be as
specific as possible to help others understand the purpose and impact of
your modifications.
### Checklist
The following checks are **mandatory**:
- [ ] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Moving key-concepts-related docs to a separate dir should make it easier
to navigate in `docs/` folder and helps to avoid adding prefixes to
image assets.
-----------
The change shouldn't have any visual changes or changes to the links.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Moving changelog-related docs to a separate dir should make it easier to
navigate in `docs/` folder.
-----------
The change shouldn't have any visual changes or changes to the links.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
- fix TS lint
- anomaly: remove /vmui
- anomaly: minor inspections fix
- docs: fix broken links to headings
### Describe Your Changes
Initially vmanomaly opened with `/vmui` in serverUrl, remove it.
* Adds custom dial func for HTTP-Connect and socks5 proxy tunnels.
Standard golang http.transport exposes GetProxyConnectHeader function,
but it doesn't allow to use separate tls config for proxy.
It also not possible to enforce HTTP-Connect with standard http lib.
* For http scrape targets, by default http.Transport.Proxy function must
be used. Since it has special case with full uri forward.
* Adds proxy.URL json methods that allow to properly copy internal
fields, like User/Password.
It should fix bug with proxy_url. When credentials specified at URL was
ignored.
* Adds tests for scrape client proxy requests
related issue https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6771
* It was necessary to add default ports for fasthttp client. After migration to the std.httpclient it's no longer needed.
* An additional configuration is required at proxy servers with implicitly set 80/443 ports to the host header (such as HA proxy.
It's expected that after upgrade __address_ label may change. But it should be rare case. 80/443 ports are not widely used at monitoring ecosystem. And it shouldn't have much impact.
Related issue https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6792
Co-authored-by: Nikolay <nik@victoriametrics.com>
to allow configuring additional headers in each request to the
corresponding notifier.
Other flags like `-datasource.headers`, `-remoteWrite.headers` already
use `^^` as delimiter, it's consistent to use it in `-notifier.headers`
as well.
related https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3260
vmalert can integrate with alertmanager that supports multi-tenant by
adding tenantID header`X-Scope-OrgID` in requests.
In multitenancy, vmalert can also filter alerts which send to different
notifier addresses(or with different header settings) using
`alert_relabel_configs`.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3260
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
### Describe Your Changes
docs: vmanomaly - v1.15.5 patch notes
### Checklist
The following checks are **mandatory**:
- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
### Describe Your Changes
Updated model list in Anomaly Detection Overview
### Checklist
The following checks are **mandatory**:
- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
### Describe Your Changes
release notes for 1.15.4 patch
### Checklist
The following checks are **mandatory**:
- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
### Describe Your Changes
- change links from relative to absolute under Anomaly Detection section
### Checklist
The following checks are **mandatory**:
- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
### Describe Your Changes
changelog updates to v1.15.3 patch of `vmanomaly`
### Checklist
The following checks are **mandatory**:
- [ ] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
### Describe Your Changes
small update to `data_range` parameter in uppermost config conversion
example
### Checklist
The following checks are **mandatory**:
- [ ] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
### Describe Your Changes
update vmanomaly docs to forthcoming release v1.15.2
### Checklist
The following checks are **mandatory**:
- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
Production workload shows that it's useful optimisation.
Channel based objects pool allows to handle irregural data ingestion
requests and make memory allocations more smooth.
It's improves sync.Pool efficiency, since objects from sync.Pool removed
after 2 GC cycles. With GOGC=30 value, GC runs significantly more often.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6733
### Checklist
The following checks are **mandatory**:
- [ ] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
---------
Signed-off-by: f41gh7 <nik@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
(cherry picked from commit f255800da3)
Signed-off-by: hagen1778 <roman@victoriametrics.com>
# Conflicts:
# app/vminsert/common/insert_ctx_pool.go
Describe Your Changes
When I use usePromCompatibleNaming with vmagent to process data that
needs to be formatted from different sources such as InfluxDB, I find
that it doesn’t work
However, it works in vminsert. I found that vminsert uses the
HasRelabeling method to determine whether to relabel.
```go
func HasRelabeling() bool {
pcs := pcsGlobal.Load()
return pcs.Len() > 0 || *usePromCompatibleNaming
}
```
in vmagent, the decision to relabel is determined only by
pcsGlobal.Len() > 0. However, in the applyRelabeling method, the
usePromCompatibleNaming logic is also used to determine whether to
relabel in the error handling.
```go
func (rctx *relabelCtx) applyRelabeling(tss []prompbmarshal.TimeSeries, pcs *promrelabel.ParsedConfigs) []prompbmarshal.TimeSeries {
if pcs.Len() == 0 && !*usePromCompatibleNaming {
// Nothing to change.
return tss
}
```
So I think that the logic for determining whether to relabel in vmagent
is not as expected.
Checklist
The following checks are mandatory:
[✅]My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
---------
Co-authored-by: Roman Khavronenko <hagen1778@gmail.com>
…eep_metric_names` options in stream aggregation config together
With aggregated data and raw data under the same metric, results would
be confusing.
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
### Describe Your Changes
typos fix & clarity improvement of vmanomaly docs after v1.15.1 release
### Checklist
The following checks are **mandatory**:
- [ ] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
### Describe Your Changes
Updated user management guide with new cloud content
This PR should be merged after the cloud PR
### Checklist
The following checks are **mandatory**:
- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
Before, buffer growth was always x2 of its size, which could lead to
excessive memory usage when processing big amount of data.
For example, scraping a target with hundreds of MBs in response could
result into hih memory spikes in vmagent because buffer has to double
its size to fit the response. See
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6759
The change smoothes out the growth rate, trading higher allocation rate
for lower mem usage at certain conditions.
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
### Describe Your Changes
* `sort` param is unused by the current website engine, and was present only for compatibility
with previous website engine. It is time to remove it as it makes no effect
* re-structure guides content into folders to simplify assets management
### Checklist
The following checks are **mandatory**:
- [ ] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
### Describe Your Changes
Fixing remaining typos and missing words after v.1.15.0 updates
### Checklist
The following checks are **mandatory**:
- [ ] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
### Describe Your Changes
- updated docs on `vmanomaly` with v1.15.0
- additional chapters of FAQ and model pages
### Checklist
The following checks are **mandatory**:
- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
VM has different responses to equivalent queries for MetricsQL and
GraphiteQL in case of failed access to one of vmstorage node of the
cluster vmstorage nodes. For GraphiteQL, the denyPartialResponse feature
is not used, it is always true, which is not always correct (depending
on the configuration).
In the PR I have removed the hardcoded denyPartialResponse for
GraphiteQL, just like MetricsQL does.
- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
(cherry picked from commit 79008b712f)
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The resetState arg was used only for the BenchmarkAggregatorsFlushInternalSerial benchmark.
This benchmark was testing aggregate state flush performance by keeping the same state across flushes.
The benhmark didn't reflect the performance and scalability of stream aggregation in production,
while it led to non-trivial code changes related to resetState arg handling.
So let's drop the benchmark together with all the code related to resetState handling,
in order to simplify the code at lib/streamaggr a bit.
Thanks to @AndrewChubatiuk for the original idea at https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6314
Prevsiously every aggregation output was using its own timestamp for the output aggregated samples
in a single aggregation interval. This could result in unexpected inconsitent timesetamps for the output
aggregated samples.
This commit consistently uses the same timestamp across all the output aggregated samples.
This commit makes sure that the duration between subsequent timestamps strictly equals
the configured aggregation interval.
Thanks to @AndrewChubatiuk for the original idea at https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6314
This commit should help https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4580
Make `-remoteWrite.streamAggr.ignoreFirstIntervals` of array type so it could
accept multiple values which can be applied to the corresponding`-remoteWrite.url`.
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
Fix `-streamAggr.dropInputLabels` behavior when global deduplication is enabled without `-streamAggr.config`.
Previously, `-remoteWrite.streamAggr.dropInputLabels` is misapplied.
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
By introducing this feature, users will have the ability to customize
the sampleLimit parameter on a per-target basis, providing more
flexibility and control over the job execution behavior.
The error check was needed before a84491324d
It was kept by mistake and makes no sense to have rn.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
### Describe Your Changes
fixed yaml header in a guide doc, that causes hugo build error
### Checklist
The following checks are **mandatory**:
- [ ] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
### Describe Your Changes
- Adds support for displaying the top 5 log streams in the hits graph,
grouping the remaining streams into an "other" label.
#6545
- Adds options to customize the graph display with bar, line, stepped
line, and points views.
### Checklist
The following checks are **mandatory**:
- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
The changes are based on SEO report and supposed to improve
ranking and indexation by search engines by using prompt and unique titles
and by updating unreachable links.
It also updates links to have a simplified form and replaces relative links with absolute links
according to https://docs.victoriametrics.com/#documentation
---------
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
Adds Prometheus Grafana Alloy and vmagent to the data ingestion
protocols. Grafana Agent was not added since it has been deprecated in
favor of alloy
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
### Describe Your Changes
https://victoriametrics.slack.com/archives/C05UNTPAEDN/p1722833182319299
Sometimes users may use the wrong (lower) version of Grafana when
setting up the VictoriaLogs datasource.
It would be good to document the requirements of Grafana versions to use
VictoriaMetrics/VictoriaLogs datasource.
### Checklist
The following checks are **mandatory**:
- [X] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
### Describe Your Changes
Fix a typo in docs.
### Checklist
The following checks are **mandatory**:
- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
### Describe Your Changes
Added `--vm-backoff-retries`, `--vm-backoff-factor`,
`--vm-backoff-min-duration` and `--vm-native-backoff-retries`,
`--vm-native-backoff-factor`, `--vm-native-backoff-min-duration`
command-line flags to the `vmctl` app. Those changes will help to
configure the retry backoff policy for different situations.
Related issue:
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6622
### Checklist
The following checks are **mandatory**:
- [X] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
* add a toggle button to the "Group" tab that allows users to expand or collapse all groups at once
* introduce the ability to select a key for grouping logs within the "Group" tab
* display the number of entries within each log group.
* move the Markdown toggle to the general settings panel in the upper left corner.
Changes to .ts files in vmui or vmui for logs require re-building
static files that will be included into compiled binary afterwards.
We don't update static files on each .ts change PR because it results
in too many changes and complicates review.
So we need to update these static files before the actual release.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Fixes panic if incorrect metricsql expression passed to the prettifier API.
Prettify function had misleading panic for duration expression formatting. It expected all WITH templates to be already parsed.
But WITH expression expand was removed.
Bug was introduced at e712a49898
and present at v1.98.0+ releases
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6736
Signed-off-by: f41gh7 <nik@victoriametrics.com>
* add Logo guidelines to GitHub Readme, as it may have higher chances to be viewed by users
* rm Logo guidelines from cluster version docs, as it makes no sense anymore for this page
* rm `picutre` tag from cluster version docs, as it is not used by GitHub anymore
Signed-off-by: hagen1778 <roman@victoriametrics.com>
### Describe Your Changes
- replace docs in root README with a link to official documentation
- remove old make commands for documentation
- remove redundant "VictoriaMetrics" from document titles
- merge changelog docs into a section
- rm content of Single-server-VictoriaMetrics.md as it can be included from docs/README
- add basic information to README in the root folder, so it will be useful for github users
- rm `picture` tag from docs/README as it was needed for github only, we don't display VM logo at docs.victoriametrics.com
- update `## documentation` section in docs/README to reflect the changes
- rename DD pictures, as they now belong to docs/README
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
The new panel will show the 99th quantile of scrape duration in seconds.
This should help identifying vmagent instances that experiences too high scraping durations.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
### Describe Your Changes
These changes were made as part of the title transition from Managed
VictoriaMetrics to VictoriaMetrics Cloud
### Checklist
The following checks are **mandatory**:
- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
### Describe Your Changes
After breadcrumb was added to docs.victoriametrics.com there's no need
to specify parent page name in a title
<img width="1437" alt="Screenshot 2024-07-27 at 10 20 09"
src="https://github.com/user-attachments/assets/733f41f4-a727-4f52-a7c0-6019edf1b803">
Also added vmdocs to gitignore to avoid committing it
### Checklist
The following checks are **mandatory**:
- [ ] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
This reverts commit e280d90e9a.
Reason for revert: the updated code doesn't improve the performance of table.MustAddRows for the typical case
when rows contain timestamps belonging to ptws[0].
The performance may be improved in theory for the case when all the rows belong to partiton other than ptws[0],
but this partition is automatically moved to ptws[0] by the code at lines
6aad1d43e9/lib/storage/table.go (L287-L298) ,
so the next time the typical case will work.
Also the updated code makes the code harder to follow, since it introduces an additional level of indirection
with non-trivial semantics inside table.MustAddRows - the partition.TimeRangeInPartition() function.
This function needs to be inspected and understood when reading the code at table.MustAddRows().
This function depends on minTsInRows and maxTsInRows vars, which are defined and initialized
many lines above the partition.TimeRangeInPartition() call. This complicates reading and understanding
the code even more.
The previous code was using clearer loop over rows with the clear call to partition.HasTimestamp()
for every timestamp in the row. The partition.HasTimestamp() call is used in the table.MustAddRows()
function multiple times. This makes the use of partition.HasTimestamp() call more consistent,
easier to understand and easier to maintain comparing to the mix of partition.HasTimestamp() and partition.TimeRangeInPartition()
calls.
Aslo, there is no need in documenting some hardcore software engineering refactoring at docs/CHANGLELOG.md,
since the docs/CHANGELOG.md is intended for VictoriaMetrics users, who may not know software engineering.
The docs/CHANGELOG.md must document user-visible changes, and the docs must be concise and clear for VictoriaMetrics users.
See https://docs.victoriametrics.com/contributing/#pull-request-checklist for more details.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6629
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6677
Relative links in docs are much harder to maintain in consistent state comparing to absolute links:
- It is non-trivial to figure out the proper relative link path when creating and editing docs.
- Relative links break after moving the doc files to another paths, and it is non-trivial
to figure which links are broken after that.
See also f357ee57ef , d5809f8e12 and 8cb1822b94
- Document `make docs-debug` command at https://docs.victoriametrics.com/#documentation
- Remove unneeded ROOTDIR, REPODIR and WORKDIR env vars from docs/Makefile ,
since it is documented and expected that all the Makefile commands are run from the repository root.
- Use `docker --rm` for running Docker container with local docs server, so it is automatically
removed after pressing `Ctrl+C`. This makes the container cleanup automatic.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6677
### Describe Your Changes
The original logic is not only highly complex but also poorly readable,
so it can be modified to increase readability and reduce time
complexity.
---------
Co-authored-by: Zhu Jiekun <jiekun@victoriametrics.com>
### Describe Your Changes
Please provide a brief description of the changes you made. Be as
specific as possible to help others understand the purpose and impact of
your modifications.
### Checklist
The following checks are **mandatory**:
- [ ] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
### Describe Your Changes
Replaced global http links in docs with relative markdown ones
### Checklist
The following checks are **mandatory**:
- [ ] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
### Describe Your Changes
Use relative markdown references, removed `{{< ref >}}` shortcodes
### Checklist
The following checks are **mandatory**:
- [ ] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
### Describe Your Changes
- moved files from root to VictoriaMetrics folder to be able to mount
operator docs and VictoriaMetrics docs independently
- added ability to run website locally
### Checklist
The following checks are **mandatory**:
- [ ] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
### Describe Your Changes
Please provide a brief description of the changes you made. Be as
specific as possible to help others understand the purpose and impact of
your modifications.
### Checklist
The following checks are **mandatory**:
- [ ] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
### Describe Your Changes
Change response code to 502 to align it with behaviour of other existing
reverse proxies. Currently, the following reverse proxies will return
502 in case an upstream is not available: nginx, traefik, caddy, apache.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
### Describe Your Changes
Improve documentation to help:
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6670
Currently, the documentation of Kafka Integration did not describe:
- How to switch between different remote write protocols in producer
side and consumer side.
### Describe Your Changes
Add a note to clarify usage of azure credentials when multiple
credentials are available.
The user is required to specify AZURE_CLIENT_ID as otherwise Azure API
will return an error: "Multiple user assigned identities exist, please
specify the clientId / resourceId of the identity in the token request"
### Checklist
The following checks are **mandatory**:
- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
The /some_path/.+ regexp matches /some_path/ followed by at least a single char.
This is unexpected by most users, since they expect it should match /some_path/.
Substitute .+ with .*, so this regexp matches /some_path/ .
There is no need in enumerating all the files, which must be updated, since these files
may be moved to other locations. It is enough to mention that versions for all the VictoriaMetrics
components must be updated.
### Describe Your Changes
Add patch note v1.13.3 to CHANGELOG doc page for `vmanomaly`
### Checklist
The following checks are **mandatory**:
- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
- Mention that credentials can be configured via env variables at both vmbackup and vmrestore docs.
- Make clear that the AZURE_STORAGE_DOMAIN env var is optional at https://docs.victoriametrics.com/vmbackup/#providing-credentials-via-env-variables
- Use string literals as is for env variable names instead of indirecting them via string constants.
This makes easier to read and understand the code. These environment variable names aren't going to change
in the future, so there is no sense in hiding them under string constants with some other names.
- Refer to https://docs.victoriametrics.com/vmbackup/#providing-credentials-via-env-variables in error messages
when auth creds are improperly configured. This should simplify figuring out how to fix the error.
- Simplify the code a bit at FS.newClient(), so it is easier to follow it now.
While at it, remove the check when superflouos environment variables are set, since it is too fragile
and it looks like it doesn't help properly configuring vmbackup / vmrestore.
- Remove envLookuper indirection - just use 'func(name string) (string, bool)' type inline.
This simplifies code reading and understanding.
- Split TestFSInit() into TestFSInit_Failure() and TestFSInit_Success(). This simplifies the test code,
so it should be easier to maintain in the future.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6518
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5984
The %q formatter may result in incorrectly formatted JSON string if the original string
contains special chars such as \x1b . They must be encoded as \u001b , otherwise the resulting JSON string
cannot be parsed by JSON parsers.
This is a follow-up for c0caa69939
See https://github.com/VictoriaMetrics/victorialogs-datasource/issues/24
- Clarify the description of -graphite.sanitizeMetricName command-line flag at README.md
- Do not sanitize tag values - only metric names and tag names must be sanitized,
since they are treated specially by Grafana. Grafana doesn't apply any restrictions on tag values.
- Properly replace more than two consecutive dots with a single dot.
- Disallow unicode letters in metric names and tag names, since neither Prometheus nor Grafana
do not support them.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6489
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6077
### Describe Your Changes
`Storage.AddRows()` returns an error only in one case: when
`Storage.updatePerDateData()` fails to unmarshal a `metricNameRaw`. But
the same error is treated as a warning when it happens inside
`Storage.add()` or returned by `Storage.prefillNextIndexDB()`.
This commit fixes this inconsistency by treating the error returned by
`Storage.updatePerDateData()` as a warning as well. As a result
`Storage.add()` does not need a return value anymore and so doesn't
`Storage.AddRows()`.
Additionally, this commit adds a unit test that checks all cases that
result in a row not being added to the storage.
---------
Signed-off-by: Artem Fetishev <wwctrsrx@gmail.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
The purpose of docs/CHANGELOG.md is to provide VictoriaMetrics users clear and concise information
on what's changed at VictoriaMetrics components. Technical details of the change are unclear
to most of VictoriaMetrics users, who are not familiar with VictoriaMetrics source code.
These details complicate reading the docs/CHANGELOG.md by ordinary users, so do not clutter
the changelog with technical details. If the user wants technical details, he can click
the link to the related GitHub issue and/or pull request and dive into all the details he wants.
- Rename overrideHostHeader() function to hasEmptyHostHeader()
- Rename overrideHostHeader field at UserInfo to useBackendHostHeader
This should simplify the future maintenance of the code
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6525
This reverts commit 4d66e042e3.
Reasons for revert:
- The commit makes unrelated invalid changes to docs/CHANGELOG.md
- The changes at app/vmauth/main.go are too complex. It is better splitting them into two parts:
- pooling readTrackingBody struct for reducing pressure on GC
- avoiding to use readTrackingBody when -maxRequestBodySizeToRetry command-line flag is set to 0
Let's make this in the follow-up commits!
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6445
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6533
- Obtain IAM token via GCE-like API instead of Amazon EC2 IMDSv2 API,
since it looks like IMDBSv2 API isn't supported by Yandex Cloud
according to https://yandex.cloud/en/docs/security/standard/authentication#aws-token :
> So far, Yandex Cloud does not support version 2, so it is strongly recommended
> to technically disable getting a service account token via the Amazon EC2 metadata service.
- Try obtaining IAM token via GCE-like API at first and then fall back to the deprecated Amazon EC2 IMDBSv1.
This should prevent from auth errors for instances with disabled GCE-like auth API.
This addresses @ITD27M01 concern at https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5513#issuecomment-1867794884
- Make more clear the description of the change at docs/CHANGELOG.md , add reference to the related issue.
P.S. This change wasn't tested in prod because I have no access to Yandex Cloud.
It is recommended to test this change by @ITD27M01 and @vmazgo , who filed
the issue https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5513
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6524
This reverts commit 6b128da811.
Reason for revert: this complicates and slows down CI/CD without giving significant benefits in return.
The idea of automatic building, publishing and deploying Docker images to our playground on every pull request
and commit isn't very bright because of the following reasons:
- It slows down CI/CD pipeline
- It increases costs on CPU time spent at CI/CD pipeline
- It contradicts goal #7 at https://docs.victoriametrics.com/goals/#goals and non-goal #8 at https://docs.victoriametrics.com/goals/#non-goals
The previous workflow was much better - if we need to deploy some new Docker image at playground or staging environment,
then just __manually__ build and deploy the needed Docker image there. If the manual process requires making too many
steps, then think on how to automate these steps into a single Makefile command.
Updates https://github.com/VictoriaMetrics/ops/pull/1297
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6515
- Move the test for SRV discovery into a separate function. This allows verifying round-robin discovery across SRV records.
- Restore the original netutil.Resolver after the test finishes, so it doesn't interfere with other tests.
- Move the description of the bugfix into the correct place at docs/CHANGELOG.md - it should be placed under v1.102.0-rc2
instead of v1.102.0-rc1.
- Remove unneeded code in URLPrefix.sanitizeAndInitialize(), since it is expected this function is called only once
for finishing URLPrefix initializiation. In this case URLPrefix.nextDiscoveryDeadline and URLPrefix.n are equal to 0
according to https://pkg.go.dev/sync/atomic#Uint64
- Properly fix the bug at URLPrefix.discoverBackendAddrsIfNeeded() - it is expected that hostToAddrs map uses
the original hostname keys, including 'srv+' prefix, so it shouldn't be removed when looping over up.busOriginal.
Instead, the 'srv+' prefix must be removed from the hostname only locally before passing the hostname to netutil.Resolver.LookupSRV.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6401
- Rename GetStatDialFunc to NewStatDialFunc, since it returns new function with every call
- NewStatDialFunc isn't related to http in any way, so it must be moved from lib/httputils to lib/netutil
- Simplify the implementation of NewStatDialFunc by removing sync.Map from there.
- Use netutil.NewStatDialFunc at app/vmauth and lib/promscrape/discoveryutils
- Use gauge instead of counter type for *_conns metric
This is a follow-up for d7b5062917
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6299
- Move the remaining code responsible for stream aggregation initialization from remotewrite.go to streamaggr.go .
This improves code maintainability a bit.
- Properly shut down streamaggr.Aggregators initialized inside remotewrite.CheckStreamAggrConfigs().
This prevents from potential resource leaks.
- Use separate functions for initializing and reloading of global stream aggregation and per-remoteWrite.url stream aggregation.
This makes the code easier to read and maintain. This also fixes INFO and ERROR logs emitted by these functions.
- Add an ability to specify `name` option in every stream aggregation config. This option is used as `name` label
in metrics exposed by stream aggregation at /metrics page. This simplifies investigation of the exposed metrics.
- Add `path` label additionally to `name`, `url` and `position` labels at metrics exposed by streaming aggregation.
This label should simplify investigation of the exposed metrics.
- Remove `match` and `group` labels from metrics exposed by streaming aggregation, since they have little practical applicability:
it is hard to use these labels in query filters and aggregation functions.
- Rename the metric `vm_streamaggr_flushed_samples_total` to less misleading `vm_streamaggr_output_samples_total` .
This metric shows the number of samples generated by the corresponding streaming aggregation rule.
This metric has been added in the commit 861852f262 .
See https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6462
- Remove the metric `vm_streamaggr_stale_samples_total`, since it is unclear how it can be used in practice.
This metric has been added in the commit 861852f262 .
See https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6462
- Remove Alias and aggrID fields from streamaggr.Options struct, since these fields aren't related to optional params,
which could modify the behaviour of the constructed streaming aggregator.
Convert the Alias field to regular argument passed to LoadFromFile() function, since this argument is mandatory.
- Pass Options arg to LoadFromFile() function by reference, since this structure is quite big.
This also allows passing nil instead of Options when default options are enough.
- Add `name`, `path`, `url` and `position` labels to `vm_streamaggr_dedup_state_size_bytes` and `vm_streamaggr_dedup_state_items_count` metrics,
so they have consistent set of labels comparing to the rest of streaming aggregation metrics.
- Convert aggregator.aggrStates field type from `map[string]aggrState` to `[]aggrOutput`, where `aggrOutput` contains the corresponding
`aggrState` plus all the related metrics (currently only `vm_streamaggr_output_samples_total` metric is exposed with the corresponding
`output` label per each configured output function). This simplifies and speeds up the code responsible for updating per-output
metrics. This is a follow-up for the commit 2eb1bc4f81 .
See https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6604
- Added missing urls to docs ( https://docs.victoriametrics.com/stream-aggregation/ ) in error messages. These urls help users
figuring out why VictoriaMetrics or vmagent generates the corresponding error messages. The urls were removed for unknown reason
in the commit 2eb1bc4f81 .
- Fix incorrect update for `vm_streamaggr_output_samples_total` metric in flushCtx.appendSeriesWithExtraLabel() function.
While at it, reduce memory usage by limiting the maximum number of samples per flush to 10K.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5467
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6268
### Describe Your Changes
initial docs to implement #6618 more platforms can be added on this
branch or on future commits.
### Checklist
The following checks are **mandatory**:
- [x ] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
### Describe Your Changes
Doc updates after v1.13.2 release of `vmanomaly`
### Checklist
The following checks are **mandatory**:
- [ ] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
…ion about Grafana Datasource in quering
### Describe Your Changes
Update Roadmap and Querying documentation for VictoriaLogs
### Checklist
The following checks are **mandatory**:
- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
---------
Signed-off-by: Artem Navoiev <tenmozes@gmail.com>
Pending rows and items unconditionally remain in memory for up to pending{Items,Rows}FlushInterval,
so there is no any sense in setting dataFlushInterval (the interval for guaranteed flush of in-memory data to disk)
to values smaller than pending{Items,Rows}FlushInterval, since this doesn't affect the interval
for flushing pending rows and items from memory to disk.
This is a follow-up for 4c80b17027
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6221
- Consistently enumerate stream aggregation outputs in alphabetical order across the source code and docs.
This should simplify future maintenance of the corresponding code and docs.
- Fix the link to `rate_sum()` at `see also` section of `rate_avg()` docs.
- Make more clear the docs for `rate_sum()` and `rate_avg()` outputs.
- Encapsulate output metric suffix inside rateAggrState. This eliminates possible bugs related
to incorrect suffix passing to newRateAggrState().
- Rename rateAggrState.total field to less misleading rateAggrState.increase name, since it calculates
counter increase in the current aggregation window.
- Set rateLastValueState.prevTimestamp on the first sample in time series instead of the second sample.
This makes more clear the code logic.
- Move the code for removing outdated entries at rateAggrState into removeOldEntries() function.
This make the code logic inside rateAggrState.flushState() more clear.
- Do not write output sample with zero value if there are no input series, which could be used
for calculating the rate, e.g. if only a single sample is registered for every input series.
- Do not take into account input series with a single registered sample when calculating rate_avg(),
since this leads to incorrect results.
- Move {rate,total}AggrState.flushState() function to the end of rate.go and total.go files, so they look more similar.
This shuld simplify future mantenance.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6243
The old link was changed globally to the new link in the commit f4b1cbfef0 .
Unfortunately, old links are still posted in new commits :(
This is a follow-up for 680b8c25c8 .
While at it, remove duplicate 'len(*remoteWriteURLs) > 0' check in the remotewrite.Init() functions,
since this check is already made at the beginning of the function.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6253
- Drop samples and return true from remotewrite.TryPush() at fast path when all the remote storage
systems are configured with the disabled on-disk queue, every in-memory queue is full
and -remoteWrite.dropSamplesOnOverload is set to true. This case is quite common,
so it should be optimized. Previously additional CPU time was spent on per-remoteWriteCtx
relabeling and other processing in this case.
- Properly count the number of dropped samples inside remoteWriteCtx.pushInternalTrackDropped().
Previously dropped samples were counted only if -remoteWrite.dropSamplesOnOverload flag is set.
In reality, the samples are dropped when they couldn't be sent to the queue because in-memory queue is full
and on-disk queue is disabled.
The remoteWriteCtx.pushInternalTrackDropped() function is called by streaming aggregation for pushing
the aggregated data to the remote storage. Streaming aggregation cannot wait until the remote storage
processes pending data, so it drops aggregated samples in this case.
- Clarify the description for -remoteWrite.disableOnDiskQueue command-line flag at -help output,
so it is clear that this flag can be set individually per each -remoteWrite.url.
- Make the -remoteWrite.dropSamplesOnOverload flag global. If some of the remote storage systems
are configured with the disabled on-disk queue, then there is no sense in keeping samples
on some of these systems, while dropping samples on the remaining systems, since this
will result in global stall on the remote storage system with the disabled on-disk queue
and with the -remoteWrite.dropSamplesOnOverload=false flag. vmagent will always return false
from remotewrite.TryPush() in this case. This will result in infinite duplicate samples
written to the remaining remote storage systems. That's why the -remoteWrite.dropSamplesOnOverload
is forcibly set to true if more than one -remoteWrite.disableOnDiskQueue flag is set.
This allows proceeding with newly scraped / pushed samples by sending them to the remaining
remote storage systems, while dropping them on overloaded systems with the -remoteWrite.disableOnDiskQueue flag set.
- Verify that the remoteWriteCtx.TryPush() returns true in the TestRemoteWriteContext_TryPush_ImmutableTimeseries test.
- Mention in vmagent docs that the -remoteWrite.disableOnDiskQueue command-line flag can be set individually per each -remoteWrite.url.
See https://docs.victoriametrics.com/vmagent/#disabling-on-disk-persistence
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6248
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6065
This makes test code more clear and reduces the number of code lines by 500.
This also simplifies debugging tests. See https://itnext.io/f-tests-as-a-replacement-for-table-driven-tests-in-go-8814a8b19e9e
While at it, consistently use t.Fatal* instead of t.Error* across tests, since t.Error*
requires more boilerplate code, which can result in additional bugs inside tests.
While t.Error* allows writing logging errors for the same, this doesn't simplify fixing
broken tests most of the time.
This is a follow-up for a9525da8a4
This simplifies debugging tests and makes the test code more clear and concise.
See https://itnext.io/f-tests-as-a-replacement-for-table-driven-tests-in-go-8814a8b19e9e
While at is, consistently use t.Fatal* instead of t.Error* across tests, since t.Error*
requires more boilerplate code, which can result in additional bugs inside tests.
While t.Error* allows writing logging errors for the same, this doesn't simplify fixing
broken tests most of the time.
This is a follow-up for a9525da8a4
### Describe Your Changes
Please provide a brief description of the changes you made. Be as
specific as possible to help others understand the purpose and impact of
your modifications.
### Checklist
The following checks are **mandatory**:
- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
### Describe Your Changes
Please provide a brief description of the changes you made. Be as
specific as possible to help others understand the purpose and impact of
your modifications.
### Checklist
The following checks are **mandatory**:
- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
### Describe Your Changes
Please provide a brief description of the changes you made. Be as
specific as possible to help others understand the purpose and impact of
your modifications.
### Checklist
The following checks are **mandatory**:
- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
Consistently using t.Fatal* simplifies the test code and makes it less fragile, since it is common error
to forget to make proper cleanup after t.Error* call. Also t.Error* calls do not provide any practical
benefits when some tests fail. They just clutter test output with additional noise information,
which do not help in fixing failing tests most of the time.
While at it, improve errors generated at app/victoria-metrics tests, so they contain more useful information
when debugging failed tests.
This is a follow-up for a9525da8a4
### Describe Your Changes
Implement spellcheck command:
- add cspell configuration files
- dockerize spellchecking process
- add Makefile targets
This PR adds a standalone `make spellcheck` target to check `docs/*.md` files for spelling
errors. The target process is dockerized to be run in a separate npm environment.
Some `docs/` typo fixes also included.
### Checklist
The following checks are **mandatory**:
- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
---------
Signed-off-by: Arkadii Yakovets <ark@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
### Describe Your Changes
Replace VM- with VictoriaMetrics in QuickStart
Keep the previous anchors for backward compatibility
### Checklist
The following checks are **mandatory**:
- [ x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
---------
Signed-off-by: Artem Navoiev <tenmozes@gmail.com>
### Describe Your Changes
In most cases histograms are exposed in sorted manner with lower buckets
being first. This means that during scraping buckets with lower bounds
have higher chance of being updated earlier than upper ones.
Previously, values were propagated from upper to lower bounds, which
means that in most cases that would produce results higher than expected
once all buckets will become updated.
Propagating from upper bound effectively limits highest value of
histogram to the value of previous scrape. Once the data will become
consistent in the subsequent evaluation this causes spikes in the
result.
Changing propagation to be from lower to higher buckets reduces value
spikes in most cases due to nature of the original inconsistency.
See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4580
An example histogram with previous(red) and updated(blue) versions:

This also makes logic of filling nan values with lower buckets values: [1 2 3 nan nan nan] => [1 2 3 3 3 3] obsolete.
Since buckets are now fixed from lower ones to upper this happens in the main loop, so there is no need in a second one.
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Andrii Chubatiuk <andrew.chubatiuk@gmail.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
vmbackup, vmrestore and vmbackupmanager use the same libs
for integrations with object storage. That means the auth can be configured
in the same way for all of them. So the docs should have either identical
config section for all 3 components, or we should cross-link to one source of truth.
This change removes incomplete auth options from vmrestore docs and adds link
to complete auth options in vmbackup instead.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
### Describe Your Changes
These changes support using Azure Managed Identity for the `vmbackup`
utility. It adds two new environment variables:
* `AZURE_USE_DEFAULT_CREDENTIAL`: Instructs the `vmbackup` utility to
build a connection using the [Azure Default
Credential](https://pkg.go.dev/github.com/Azure/azure-sdk-for-go/sdk/azidentity@v1.5.2#NewDefaultAzureCredential)
mode. This causes the Azure SDK to check for a variety of environment
variables to try and make a connection. By default, it tries to use
managed identity if that is set up.
This will close
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5984
### Checklist
The following checks are **mandatory**:
- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
### Testing
However you normally test the `vmbackup` utility using Azure Blob should
continue to work without any changes. The set up for that is environment
specific and not listed out here.
Once regression testing has been done you can set up [Azure Managed
Identity](https://learn.microsoft.com/en-us/entra/identity/managed-identities-azure-resources/overview)
so your resource (AKS, VM, etc), can use that credential method. Once it
is set up, update your environment variables according to the updated
documentation.
I added unit tests to the `FS.Init` function, then made my changes, then
updated the unit tests to capture the new branches.
I tested this in our environment, but with SAS token auth and managed
identity and it works as expected.
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Justin Rush <jarush@epic.com>
Co-authored-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
'any' type is supported starting from Go1.18. Let's consistently use it
instead of 'interface{}' type across the code base, since `any` is easier to read than 'interface{}'.
This makes easier to read and debug these tests. This also reduces test lines count by 15% from 3K to 2.5K
See https://itnext.io/f-tests-as-a-replacement-for-table-driven-tests-in-go-8814a8b19e9e
While at it, consistently use t.Fatal* instead of t.Error*, since t.Error* usually leads
to more complicated and fragile tests, while it doesn't bring any practical benefits over t.Fatal*.
* restore old anchor names to keep links compatibility.
See https://docs.victoriametrics.com/#documentation requirements
* consistently use the same format for commands `sh` as it makes it better
renderred and automatically adds `copy` button to fileds with commands
* simplify the text by removing extra points in the list
* add recommendations for installing the cluster setup
* explicitly mention the ports services are listening on
* add description for `storageNode` cmd-line flag to inform the reader what
values need to be put into it
* fix the incorrect vmui link in cluster installation recommendation
* rename component anchors to be more unique, because URL doesn't respect
hierarchy for the anchored links and may result into conflicts in future
Signed-off-by: hagen1778 <roman@victoriametrics.com>
### Describe Your Changes
Updated Quickstart guide for VIctoriaMetrics and VictoriaMetrics Cluster to include instructions for installing the binaries by hand
### Checklist
The following checks are **mandatory**:
- [ x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
### Describe Your Changes
Fixed Custom Model guide according to newer `vmanomaly` versions
### Checklist
The following checks are **mandatory**:
- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
### Describe Your Changes
docs: fix typos
### Checklist
The following checks are **mandatory**:
- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
### Describe Your Changes
Please provide a brief description of the changes you made. Be as
specific as possible to help others understand the purpose and impact of
your modifications.
### Checklist
The following checks are **mandatory**:
- [ ] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
### Describe Your Changes
This PR is aimed to change the currently in place configuration of
running Go related jobs for code changes that don't contain actual Go
files ([example
1](https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6517/checks)
- 2m32s , [example
2](https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6543/checks)
- 4m11s).
In order to do that the `build` workflow was extracted from Go related
workflow (now it doesn't require lint as a `need` step -- let me know if
it's something we want to keep). It will run upon the same triggers as
before the change.
The `main` workflow now will be triggered by `**.go` pattern only and
contains lint/test steps that are relevant for Go file changes.
I expect this PR +
https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6540 to improve
CI minutes usage.
### Checklist
The following checks are **mandatory**:
- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
---------
Signed-off-by: Arkadii Yakovets <ark@victoriametrics.com>
Reason for revert:
There are many statsd servers exist:
- https://github.com/statsd/statsd - classical statsd server
- https://docs.datadoghq.com/developers/dogstatsd/ - statsd server from DataDog built into DatDog Agent ( https://docs.datadoghq.com/agent/ )
- https://github.com/avito-tech/bioyino - high-performance statsd server
- https://github.com/atlassian/gostatsd - statsd server in Go
- https://github.com/prometheus/statsd_exporter - statsd server, which exposes the aggregated data as Prometheus metrics
These servers can be used for efficient aggregating of statsd data and sending it to VictoriaMetrics
according to https://docs.victoriametrics.com/#how-to-send-data-from-graphite-compatible-agents-such-as-statsd (
the https://github.com/prometheus/statsd_exporter can be scraped as usual Prometheus target
according to https://docs.victoriametrics.com/#how-to-scrape-prometheus-exporters-such-as-node-exporter ).
Adding support for statsd data ingestion protocol into VictoriaMetrics makes sense only if it provides
significant advantages over the existing statsd servers, while has no significant drawbacks comparing
to existing statsd servers.
The main advantage of statsd server built into VictoriaMetrics and vmagent - getting rid of additional statsd server.
The main drawback is non-trivial and inconvenient streaming aggregation configs, which must be used for the ingested statsd metrics (
see https://docs.victoriametrics.com/stream-aggregation/ ). These configs are incompatible with the configs for standalone statsd servers.
So you need to manually translate configs of the used statsd server to stream aggregation configs when migrating
from standalone statsd server to statsd server built into VictoriaMetrics (or vmagent).
Another important drawback is that it is very easy to shoot yourself in the foot when using built-in statsd server
with the -statsd.disableAggregationEnforcement command-line flag or with improperly configured streaming aggregation.
In this case the ingested statsd metrics will be stored to VictoriaMetrics as is without any aggregation.
This may result in high CPU usage during data ingestion, high disk space usage for storing all the unaggregated
statsd metrics and high CPU usage during querying, since all the unaggregated metrics must be read, unpacked and processed
during querying.
P.S. Built-in statsd server can be added to VictoriaMetrics and vmagent after figuring out more ergonomic
specialized configuration for aggregating of statsd metrics. The main requirements for this configuration:
- easy to write, read and update (ideally it should work out of the box for most cases without additional configuration)
- hard to misconfigure (e.g. hard to shoot yourself in the foot)
It would be great if this configuration will be compatible with the configuration of the most widely used statsd server.
In the mean time it is recommended continue using external statsd server.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6265
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5053
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5052
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/206
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4600
This reverts commit 5a3abfa041.
Reason for revert: exemplars aren't in wide use because they have numerous issues which prevent their adoption (see below).
Adding support for examplars into VictoriaMetrics introduces non-trivial code changes. These code changes need to be supported forever
once the release of VictoriaMetrics with exemplar support is published. That's why I don't think this is a good feature despite
that the source code of the reverted commit has an excellent quality. See https://docs.victoriametrics.com/goals/ .
Issues with Prometheus exemplars:
- Prometheus still has only experimental support for exemplars after more than three years since they were introduced.
It stores exemplars in memory, so they are lost after Prometheus restart. This doesn't look like production-ready feature.
See 0a2f3b3794/content/docs/instrumenting/exposition_formats.md (L153-L159)
and https://prometheus.io/docs/prometheus/latest/feature_flags/#exemplars-storage
- It is very non-trivial to expose exemplars alongside metrics in your application, since the official Prometheus SDKs
for metrics' exposition ( https://prometheus.io/docs/instrumenting/clientlibs/ ) either have very hard-to-use API
for exposing histograms or do not have this API at all. For example, try figuring out how to expose exemplars
via https://pkg.go.dev/github.com/prometheus/client_golang@v1.19.1/prometheus .
- It looks like exemplars are supported for Histogram metric types only -
see https://pkg.go.dev/github.com/prometheus/client_golang@v1.19.1/prometheus#Timer.ObserveDurationWithExemplar .
Exemplars aren't supported for Counter, Gauge and Summary metric types.
- Grafana has very poor support for Prometheus exemplars. It looks like it supports exemplars only when the query
contains histogram_quantile() function. It queries exemplars via special Prometheus API -
https://prometheus.io/docs/prometheus/latest/querying/api/#querying-exemplars - (which is still marked as experimental, btw.)
and then displays all the returned exemplars on the graph as special dots. The issue is that this doesn't work
in production in most cases when the histogram_quantile() is calculated over thousands of histogram buckets
exposed by big number of application instances. Every histogram bucket may expose an exemplar on every timestamp shown on the graph.
This makes the graph unusable, since it is litterally filled with thousands of exemplar dots.
Neither Prometheus API nor Grafana doesn't provide the ability to filter out unneeded exemplars.
- Exemplars are usually connected to traces. While traces are good for some
I doubt exemplars will become production-ready in the near future because of the issues outlined above.
Alternative to exemplars:
Exemplars are marketed as a silver bullet for the correlation between metrics, traces and logs -
just click the exemplar dot on some graph in Grafana and instantly see the corresponding trace or log entry!
This doesn't work as expected in production as shown above. Are there better solutions, which work in production?
Yes - just use time-based and label-based correlation between metrics, traces and logs. Assign the same `job`
and `instance` labels to metrics, logs and traces, so you can quickly find the needed trace or log entry
by these labes on the time range with the anomaly on metrics' graph.
- Export streamaggr.LoadFromData() function, so it could be used in tests outside the lib/streamaggr package.
This allows removing a hack with creation of temporary files at TestRemoteWriteContext_TryPush_ImmutableTimeseries.
- Move common code for mustParsePromMetrics() function into lib/prompbmarshal package,
so it could be used in tests for building []prompbmarshal.TimeSeries from string.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6205
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6206
Move the code responsible for relabelCtx clearing into deferred function.
This allows making more clear the remoteWriteCtx.TryPush code.
This is a follow-up for 879771808b
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6205
While at it, clarify the description of the bugfix at docs/CHANGELOG.md
Use metricsql.IsLikelyInvalid() function for determining whether the given query is likely invalid,
e.g. there is high change the query is incorrectly written, so it will return unexpected results.
The query is invalid most of the time if it passes something other than series selector into rollup function.
For example:
- rate(sum(foo))
- rate(foo + bar)
- rate(foo > bar)
Improtant note: the query is considered valid if it misses the lookbehind window in square brackes inside rollup function,
e.g. rate(foo), since this is very convenient MetricsQL extention to PromQL, and this query returns the expected results
most of the time.
Other unsafe query types can be added in the future into metricsql.IsLikelyInvalid().
TODO: probably, the -search.disableImplicitConversion command-line flag must be set by default in the future releases of VictoriaMetrics.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4338
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6180
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6450
- Clarify docs for `Ignore aggregation intervals on start` feature.
- Make more clear the code dealing with ignoreFirstIntervals at aggregator.runFlusher() functions.
It is better from readability and maintainability PoV using distinct a.flush() calls
for distinct cases instead of merging them into a single a.flush() call.
- Take into account the first incomplete interval when tracking the number of skipped aggregation intervals,
since this behaviour is easier to understand by the end users.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6137
Previously the in-memory buffer could remain unflushed for long periods of time under low ingestion rate.
The ingested logs weren't visible for search during this time.
### Describe Your Changes
Removed snap packages support as it requires time for maintenance and
it's not popular at all
### Checklist
The following checks are **mandatory**:
- [ ] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
### Describe Your Changes
- added stale metrics counters for input and output samples
- added labels for aggregator metrics =>
`name="{rwctx}:{aggrId}:{aggrSuffix}"`
- rwctx - global or number starting from 1
- aggrid - aggregator id starting from 1
- aggrSuffix - <interval>_(by|without)_label1_label2_labeln
e.g: `name="global:1:1m_without_instance_pod"`
### Checklist
The following checks are **mandatory**:
- [ ] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
The set of log fields in the found logs may differ from the set of log fields present in the log stream.
So compare only the log fields in the found logs when searching for the matching log entry in the log stream.
While at it, return _stream field in the delimiter log entry, since this field is used by VictoriaLogs Web UI
for grouping logs by log streams.
Fix typo
### Describe Your Changes
Fix typo in the vmalert docs. In the docs it states rules at `VMAgent`'s
namespace when it should be `VMAlert`'s namespace.
### Checklist
The following checks are **mandatory**:
- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
### Describe Your Changes
Small improvements to a QuickStart guide of `vmanomaly`
### Checklist
The following checks are **mandatory**:
- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
### Describe Your Changes
Fixes#6453
### Checklist
The following checks are **mandatory**:
- [ ] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
### Describe Your Changes
Fix Date metricid cache consistency under concurrent use.
When one goroutine calls Has() and does not find the cache entry in the
immutable map it will acquire a lock and check the mutable map. And it
is possible that before that lock is acquired, the entry is moved from
the mutable map to the immutable map by another goroutine causing a
cache miss.
The fix is to check the immutable map again once the lock is acquired.
### Checklist
The following checks are **mandatory**:
- [x ] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
---------
Signed-off-by: Artem Fetishev <wwctrsrx@gmail.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
### Describe Your Changes
- Fixed the update of the relative time range when `Execute Query` is
clicked
- Optimized server requests: now, if an error occurs in the `/query`
request, the `/hits` request will not be executed.
#6345 (duplicates: #6440, #6312)
Updating documentation around the opentelemetry endpoint for metrics and
the "How to use OpenTelemetry metrics with VictoriaMetrics" guide so
that it shows not only how to directly write but also how to write to
the otel collector and view metrics in vmui.
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
- Fixed config example in QuickStart vmanomaly docs for 1.13 version
compatibility
- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
The `make clean-checkers` command may be useful when the locally installed checker apps are too old,
so they must be updated to the newly requested versions by `make check-all`. In this case the fix is to run
make clean-checkers check-all
The TryParseTimestampRFC3339Nano() must properly parse RFC3339 timestamps with timezone offsets.
While at it, make tryParseTimestampISO8601 function private in order to prevent
from improper usage of this function from outside the lib/logstorage package.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6508
rm custom scripts for downloading Grafana plugins for
VictoriaMetrics and VictoriaLogs. Use `GF_INSTALL_PLUGINS` instead.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
This reverts commit 6e395048d3.
Reason for revert: the previous logic was correct.
The purpose of `-search.maxSamplesPerQuery` command-line flag is to limit the amounts of CPU resources,
which could be taken by a single query - see https://docs.victoriametrics.com/#resource-usage-limits .
VictoriaMetrics processes samples in blocks during querying - it reads the block, then unpacks it,
then filters out samples outside the selected time range. This means that it _spends CPU time_
on reading and unpacking of _all the samples_ in every block on the requested time range,
even if only a single sample per each block matches the given time range.
The previous logic was effectively limiting CPU time a single query could take.
The new logic fails limiting CPU time a single query could take in some pathological cases
when only a small fraction of samples per each requested block fit the requested time range.
This allows performing multiplication DoS-attacks by querying very narrow time ranges over historical blocks,
which tend to be full. For example, if the `-search.maxSamplesPerQuery` equals to a billion,
and the query requests a single sample out of 8K samples per each block, this means that the query
may unpack a billion of such blocks without exceeding the limit, e.g. it may unpack and process 8K*1e9=8e12 samples.
This is not what the resource usage limits were created for originally - see https://docs.victoriametrics.com/#resource-usage-limits
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5851
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6464
This reverts commit 92b22581e6.
Reason for revert: too complex and slow approach for spellchecking task.
This approach may significantly slow down development pace. It also may take non-trivial
amounts of additional time and resources at CI/CD because of all this npm shit at cspell directory.
Note to @arkid15r : the idea with the ability to run spellchecker on all the VictoriaMetrics
codebase is great. But this shouldn't be mandatory pre-commit check. It is enough to have
a Makefile rule like `make spellcheck`, which could be run manually whenever spellcheck is needed
(e.g. once per month or once per quarter).
Reason for revert: this commit doesn't resolve real security issues,
while it complicates the resulting code in subtle ways (aka security circus).
Comparison of two strings (passwords, auth keys) takes a few nanoseconds.
This comparison is performed in non-trivial http handler, which takes thousands
of nanoseconds, and the request handler timing is non-deterministic because of Go runtime,
Go GC and other concurrently executed goroutines. The request handler timing is even
more non-deterministic when the application is executed in shared environments
such as Kubernetes, where many other applications may run on the same host and use
shared resources of this host (CPU, RAM bandwidth, network bandwidth).
Additionally, it is expected that the passwords and auth keys are passed via TLS-encrypted connections.
Establishing TLS connections takes additional non-trivial time (millions of nanoseconds),
which depends on many factors such as network latency, network congestion, etc.
This makes impossible to conduct timing attack on passwords and auth keys in VictoriaMetrics components.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6423/files
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6392
mention change for
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6457
### Describe Your Changes
Please provide a brief description of the changes you made. Be as
specific as possible to help others understand the purpose and impact of
your modifications.
### Checklist
The following checks are **mandatory**:
- [ ] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
### Describe Your Changes
With the recent release of Managed VictoriaMetrics users are able to
create and execute Alerting & Recording rules and send notifications via
hosted Alertmanager.
So, we're publishing Alertmanager configuration docs for Managed
VictoriaMetrics.
### Checklist
The following checks are **mandatory**:
- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
* rm extra interface method for rw Client, as it has low applicability
and doesn't fit multitenancy well
* add `GetDroppedRows` method instead
Signed-off-by: hagen1778 <roman@victoriametrics.com>
### Describe Your Changes
Fixed a small typo in a comment about the mutex inside the FastQueue
struct
### Checklist
The following checks are **mandatory**:
- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
### Describe Your Changes
Trimming content which is loaded from an external pass leads to obscure
issues in case user-defined input contained trimmed chars. For example.
user-defined password "foo\n" will become "foo" while user will expect
it to contain a new line.
---
For example, a user defines a password which ends with `\n`. This often
happens when user Kubernetes secrets and manually encodes value as
base64-encoded string.
In this case vmauth configuration might look like:
```
users:
- url_prefix:
- http://vminsert:8480/insert/0/prometheus/api/v1/write
name: foo
username: foo
password: "foobar\n"
```
vmagent configuration for this setup will use the following flags:
```
-remoteWrite.url=http://vmauth:8427/
-remoteWrite.basicAuth.passwordFile=/tmp/vmagent-password
-remoteWrite.basicAuth.username="foo"
```
Where `/tmp/vmagent-password` is a file with `foobar\n` password.
Before this change such configuration will result in `401 Unauthorized`
response received by vmagent since after file content will become
`foobar`.
---
An example with Kubernetes operator which uses a secret to reference the
same password in multiple configurations.
<details>
<summary>See full manifests</summary>
`Secret`:
```
apiVersion: v1
data:
name: Zm9v # foo
password: Zm9vYmFy # foobar\n
username: Zm9v= # foo
kind: Secret
metadata:
name: vmuser
```
`VMUser`:
```
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMUser
metadata:
name: vmagents
spec:
generatePassword: false
name: vmagents
targetRefs:
- crd:
kind: VMAgent
name: some-other-agent
namespace: example
username: foo
# note - the secret above is referenced to provide password
passwordRef:
name: vmagent
key: password
```
`VMAgent`:
```
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMAgent
metadata:
name: example
spec:
selectAllByDefault: true
scrapeInterval: 5s
replicaCount: 1
remoteWrite:
- url: "http://vmauth-vmauth-example:8427/api/v1/write"
# note - the secret above is referenced as well
basicAuth:
username:
name: vmagent
key: username
password:
name: vmagent
key: password
```
</details>
Since both config target exactly the same `Secret` object it is expected
to work, but apparently the result will be `401 Unauthrized` error.
### Checklist
The following checks are **mandatory**:
- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
- Added a bar chart displaying the number of log entries over a time
range.
#6404
- When `_msg` is empty, all fields are displayed in a single line.
- Added double quotes when copying pairs: `key: "value"`.
- Minor style adjustments.
Check for ranged vector arguments in aggregate expressions when
`-search.disableImplicitConversion` or `-search.logImplicitConversion`
are enabled.
For example, `sum(up[5m])` will fail to execute if these flags are set.
### Describe Your Changes
Please provide a brief description of the changes you made. Be as
specific as possible to help others understand the purpose and impact of
your modifications.
### Checklist
The following checks are **mandatory**:
- [*] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
### Describe Your Changes
Added config example for `min_dev_from_expected` arg; also, small
styling fixes and alignments
### Checklist
The following checks are **mandatory**:
- [ ] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
### Describe Your Changes
Please provide a brief description of the changes you made. Be as
specific as possible to help others understand the purpose and impact of
your modifications.
### Checklist
The following checks are **mandatory**:
- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
### Describe Your Changes
small fix of typos in v1.13 presets (vmanomaly docs)
### Checklist
The following checks are **mandatory**:
- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
### Describe Your Changes
* check if `lastValue` was seen at least twice with different
timestamps. Otherwise, the difference between last timestamp and
previous timestamp could be `0` and will result into `NaN` calculation
* check if there items left in lastValue map after staleness cleanup.
Otherwise, `rate_avg` could have produce `NaN` result.
### Checklist
The following checks are **mandatory**:
- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Previously byte slices up to 2^20 bytes (e.g. 1Mb) were cached because of a typo in the commit c14dafce43 .
This could result in increased memory usage when vmagent scrapes many regular targets, which expose
relatively small number of metrics (e.g. up to a few thousand per target) and a few large targets such as kube-state-metrics,
which expose more than 10 thousand metrics. This is common case for Kubernetes monitoring.
While at it, remove pools for very small byte slices, since they are rarely used during scraping.
- Make it in a separate goroutine, so it doesn't slow down regular intern() calls.
- Do not lock internStringMap.mutableLock during the cleanup routine, since now
it is called from a single goroutine and reads only the readonly part of the internStringMap.
This should prevent from locking regular intern() calls for new strings during cleanups.
- Add jitter to the cleanup interval in order to prevent from synchornous increase in resource usage
during cleanups.
- Run the cleanup twice per -internStringCacheExpireDuration . This should save 30% CPU time spent
on cleanup comparing to the previous code, which was running the cleanup 3 times per -internStringCacheExpireDuration .
This change fixes the following panic:
```
2024-06-04T11:16:52.899Z warn app/vmauth/auth_config.go:353 cannot discover backend SRV records for http://srv+localhost:8080: lookup localhost on 10.100.10.4:53: server misbehaving; use it literally
panic: runtime error: integer divide by zero
goroutine 9 [running]:
github.com/VictoriaMetrics/VictoriaMetrics/lib/httpserver.handlerWrapper.func1()
/Users/lhhdz/wd/projects/go/VictoriaMetrics/lib/httpserver/httpserver.go:291 +0x58
panic({0x103115100?, 0x10338d700?})
/Users/lhhdz/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.3.darwin-arm64/src/runtime/panic.go:770 +0x124
main.getLeastLoadedBackendURL({0x0?, 0x22?, 0x1400014757b?}, 0x1400013c120?)
/Users/lhhdz/wd/projects/go/VictoriaMetrics/app/vmauth/auth_config.go:473 +0x210
main.(*URLPrefix).getBackendURL(0x140000aa080)
/Users/lhhdz/wd/projects/go/VictoriaMetrics/app/vmauth/auth_config.go:312 +0xb8
```
---------
Co-authored-by: Haley Wang <haley@victoriametrics.com>
### Describe Your Changes
Fix 404 relative img links for v1.13.0 update of vmanomaly docs
### Checklist
The following checks are **mandatory**:
- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
### Describe Your Changes
Corrected spelling mistake in the operator json to be "working" instead
of "wokring"
### Checklist
The following checks are **mandatory**:
- [ x ] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
### Describe Your Changes
value→valid
### Checklist
The following checks are **mandatory**:
- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
Signed-off-by: Lapo Luchini <lapo@lapo.it>
This should reduce GC overhead when tens of millions of strings are interned (for example, during stream deduplication
of millions of active time series).
Add support for markdown format and emoji for the `_msg` field in the
"Group" view.
Add markdown rendering toggle. Disabled by default. Value is stored in
`localStorage`.
These functions are called every time `/metrics` page is scraped, so it would be great
if they could be sped up for the cases when dedupAggr tracks tens of millions of active time series.
* adds idleConnTimeout flag, which must reduce probability of `broken
pipe` and `connection reset` errors.
* one-time retry trivial network requests for the same backend
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
- Use bytesutil.InternString() instead of strings.Clone() for inputKey and outputKey in aggregatorpushSamples().
This should reduce string allocation rate, since strings can be re-used between aggrState flushes.
- Reduce memory allocations at dedupAggrShard by storing dedupAggrSample by value in the active series map.
- Remove duplicate call to bytesutil.InternBytes() at Deduplicator, since it is already called inside dedupAggr.pushSamples().
- Add missing string interning at rateAggrState.pushSamples().
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6402
The main change is getting rid of interning of sample key. It was
discovered that for cases with many unique time series aggregated by
vmagent interned keys could grow up to hundreds of millions of objects.
This has negative impact on the following aspects:
1. It slows down garbage collection cycles, as GC has to scan all inuse
objects periodically. The higher is the number of inuse objects, the
longer it takes/the more CPU it takes.
2. It slows down the hot path of samples aggregation where each key
needs to be looked up in the map first.
The change makes code more fragile, but suppose to provide performance
optimization for heavy-loaded vmagents with stream aggregation enabled.
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
### Describe Your Changes
Added streamaggr metrics to:
- `vm_streamaggr_samples_lag_seconds` - samples lag
- `vm_streamaggr_ignored_samples_total{reason="nan"}` - ignored NaN
samples
- `vm_streamaggr_ignored_samples_total{reason="too_old"}` - ignored old
samples
Unsupported path is already handled by `lib/httpserver`.
This prevents from misleading errors in logs caused by double-writing response headers.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
### Describe Your Changes
Scratch based images will be using a separate tag: "(version)-scratch"
and will be built for the same architecture as regular images.
This is useful for environments with higher security standards. In this
case using alpine as base layer requires updating images more frequently
in order to get the latest updates for the base image, even in case the
user did not need to update VictoriaMetrics version.
Tested that scratch images work for:
- vmagent - enterprise with kafka and opensource
- cluster
- single-node
No issues observed so far.
cc: @tenmozes
### Checklist
The following checks are **mandatory**:
- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
### Describe Your Changes
Please provide a brief description of the changes you made. Be as
specific as possible to help others understand the purpose and impact of
your modifications.
### Checklist
The following checks are **mandatory**:
- [x] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
Signed-off-by: Artem Navoiev <tenmozes@gmail.com>
* "*.idleConnTimeout" flags must reduce probability of `write: broken
pipe` and `read: connection reset by peer` errors Those errors may occur
if remote server closes TCP socket for connection, while it's still
exist at client.
* single time retries for `write: broken pipe` and `read: connection
reset by peer` must handle a case for incorrectly configured timeouts at
middleware proxies, mitigate minor network issues.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5661
### Describe Your Changes
Please provide a brief description of the changes you made. Be as
specific as possible to help others understand the purpose and impact of
your modifications.
---------
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
Update wording to highlight that cache is not persistent if flag is
value is empty. Previously, it was not clear if cache is not used at all
or just not persistent.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* It must reduce memory usage for misbehaving clients. Since
VictoriaMetrics stores sparse index inmemory.
* Reduce disk space usage for indexdb.
* Prevent possible indexDB items drops.
* It may trigger slow insert and new timeseries registration due to
default value for flag change
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6176
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
This PR fixes an issue where parsing long `_msg` values caused errors,
resulting in some log records not being displayed.
The error occurred due to partial processing of strings. In some cases,
a long record could be split into multiple chunks, causing only part of
the record to be processed instead of the entire entry.
#6281
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
This should improve precision of `restarts` and `version change` annotations when
zooming-in/zooming-out on the dashboards.
The change also makes `restarts` dashboard visible on the panels, so user can disable it from
displaying if needed. This could be useful when restarts overlap with version change events.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Prevent excessive resource usage when stream aggregation config file
contains no matchers by prevent pushing data into Aggregators object.
Before this change a lot of extra work was invoked without reason.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Occasionally, vmagent sends empty blocks to downstream servers. If a
downstream server returns an unexpected response, vmagent gets stuck in
a retry loop. While vmagent handles 400 and 409 errors, there are
various prometheus remote write implementations that return different
error codes. For example, vector returns a 422 error. To mitigate the
risk of vmagent getting stuck in a retry loop, it is advisable to skip
sending empty blocks to downstream servers.
Co-authored-by: hao.peng <hao.peng@smartx.com>
Co-authored-by: Zhu Jiekun <jiekun.dev@gmail.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
* adds datadog extensions for statsd:
- multiple packed values (v1.1)
- additional types distribution, histogram
* adds type check and append metric type to the labels with special tag
name `__statsd_metric_type__`. It simplifies streaming aggregation
config.
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
Allocations are reduced by implementing custom json parser via fastjson
lib.
The change also re-uses `promInstant` object in attempt to reduce number
of
allocations when parsing big responses, as usually happens with heavy
recording rules.
```
name old allocs/op new allocs/op delta
ParsePrometheusResponse/Instant-10 9.65k ± 0% 5.60k ± 0% ~ (p=1.000 n=1+1)
```
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Allocations are reduced by re-using the byte buffer when converting
labels to string keys.
```
name old allocs/op new allocs/op delta
GetStaleSeries-10 703 ± 0% 203 ± 0% ~ (p=1.000 n=1+1)
```
Signed-off-by: hagen1778 <roman@victoriametrics.com>
it's needed to remove Summary metric type from the global state of
metrics package. metrics package tracks each bucket of summary and
periodically swaps old buckets with new.
Simple set unregister is not enough to release memory used by Set
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6247
Change the return values for these functions - now they return the unmarshaled result plus
the size of the unmarshaled result in bytes, so the caller could re-slice the src for further unmarshaling.
This improves performance of these functions in hot loops of VictoriaLogs a bit.
Added `rate` and `rate_avg` output
Resource usage is the same as for increase output, tested on a benchmark
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
### Describe Your Changes
Added makefile rule for `GOARCH=loong64` to support building all
VictoriaMetrics components on the `loongarch64` platform.
### Checklist
The following checks are **mandatory**:
* [X] My change adheres [VictoriaMetrics contributing
guidelines](https://docs.victoriametrics.com/contributing/).
Signed-off-by: qiangxuhui <qiangxuhui@loongson.cn>
Set correct suffix `<output>_prometheus` for aggregation outputs
`increase_prometheus` and `total_prometheus`
Before, outputs `total` and `total_prometheus` or `increase` and
`increase_prometheus` had the same suffix.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Though labels compressor is quite resource intensive, each aggregator
and deduplicator instance has it's own compressor. Made it shared across
all aggregators to consume less resources while using multiple
aggregators.
Co-authored-by: Roman Khavronenko <hagen1778@gmail.com>
### Describe Your Changes
related issue:
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6041
#### Added
- Added service discovery support for Vultr.
#### Docs
- `CHANGELOG.md`, `sd_configs.md`, `vmagent.md` are updated.
#### Note
- Useful links:
- Vultr API:
https://www.vultr.com/api/#tag/instances/operation/list-instances
- Vultr client SDK: https://github.com/vultr/govultr
- Prometheus SD:
https://github.com/prometheus/prometheus/tree/main/discovery/vultr
---
### Checklist
The following checks are mandatory:
- [X] I have read the [Contributing
Guidelines](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/CONTRIBUTING.md)
- [x] All commits are signed and include `Signed-off-by` line. Use `git
commit -s` to include `Signed-off-by` your commits. See this
[doc](https://git-scm.com/book/en/v2/Git-Tools-Signing-Your-Work) about
how to sign your commits.
- [x] Tests are passing locally. Use `make test` to run all tests
locally.
- [x] Linting is passing locally. Use `make check-all` to run all
linters locally.
Further checks are optional for External Contributions:
- [X] Include a link to the GitHub issue in the commit message, if issue
exists.
- [x] Mention the change in the
[Changelog](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/docs/CHANGELOG.md).
Explain what has changed and why. If there is a related issue or
documentation change - link them as well.
Tips for writing a good changelog message::
* Write a human-readable changelog message that describes the problem
and solution.
* Include a link to the issue or pull request in your changelog message.
* Use specific language identifying the fix, such as an error message,
metric name, or flag name.
* Provide a link to the relevant documentation for any new features you
add or modify.
- [ ] After your pull request is merged, please add a message to the
issue with instructions for how to test the fix or try the feature you
added. Here is an
[example](https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4048#issuecomment-1546453726)
- [x] Do not close the original issue before the change is released.
Please note, in some cases Github can automatically close the issue once
PR is merged. Re-open the issue in such case.
- [x] If the change somehow affects public interfaces (a new flag was
added or updated, or some behavior has changed) - add the corresponding
change to documentation.
Signed-off-by: Jiekun <jiekun.dev@gmail.com>
This code adds Exemplars to VMagent and the promscrape parser adhering
to OpenMetrics Specifications. This will allow forwarding of exemplars
to Prometheus and other third party apps that support OpenMetrics specs.
---------
Signed-off-by: Ted Possible <ted_possible@cable.comcast.com>
The panel will show how many ongoing select queries are processed by vmstorage
and should help to identify resource bottlenecks. See panel description for more details.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The new flag type is supposed to be used for specifying URL values which
could contain sensitive information such as auth tokens in GET params or
HTTP basic authentication.
The URL flag also allows loading its value from files if `file://`
prefix is specified. As example, the new flag type was used in
app/vmbackup as it requires specifying `authKey` param for making the
snapshot.
See related issue
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5973
Thanks to @wasim-nihal for initial implementation
https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6060
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The cumulative number of active merges could be red herring
as it its value depends on the number of vmstorages.
For example, vmstorage could be added or removed and this will affect
the panel.
Or, each vmstorage could start a merging process (i.e. for downsampling)
and visiually it could look like a massive change.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6140
We don't cover this corner case as it has low chance for reproduction.
Precisely, the requirements are following:
1. vmagent need to be configured with multiple identical `remoteWrite.url` flags;
2. At least one of the persistent queues need to be non-empty, which already
signalizes about issues with setup;
3. vmagent need to be restarted with removing of one of `remoteWrite.url` flags.
We do not document this case in vmagent.md as it seems to be a rare corner case
and its explanation will require too much of explanation and confuse users.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Starting from v1.99.0 vmalert could ignore anchors pointing to specific
rule groups if `search` param was present in URL.
This change makes anchors compatible with `search` param in UI.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Using `sh` or `console` formatting doesn't do word-breaking on render. This makes flags description
harder to read, as users need to scroll the web page horizontally.
Removing the formatting renders the description with normal word-breaking.
Stream aggregation may yield inaccurate results if it processes incomplete data.
This issue can arise when data is sourced from clients that maintain a queue of unsent data, such as Prometheus or vmagent.
If the queue isn't fully cleared within the aggregation interval, only a portion of the time series may be included in that period, leading to distorted calculations.
To mitigate this we add an option to ignore first N aggregation intervals. It is expected, that client queues
will be cleared during the time while aggregation ignores first N intervals and all subsequent aggregations
will be correct.
It is better from visibility PoV if the CONTRIBUTING.md file is visible at https://docs.victoriametrics.com/contributing/ .
This page can be indexed by search engines and searched by our users later.
It is also easier to provide a link to this page now comparing to the old link, which is much harder to remember:
https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/CONTRIBUTING.md
While at it, syncrhonize docs/CONTRIBUTING.md with .github/PULL_REQUEST_TEMPLATE/pull_request_template.md ,
so they do not contain duplicate information, which can be outdated over time. Now all the relevant information
is located at docs/CONTRIBUTING.md, while .github/PULL_REQUEST_TEMPLATE/pull_request_template.md refers this doc.
This is a follow-up for c006db1798
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6040
Using plain sync.Pool simplifies the code without increasing memory usage and CPU usage.
So it is better to use plain sync.Pool from readability and maintainability PoV.
This is a follow-up for 8942f290eb
The memory usage for plain sync.Pool doesn't increase comparing to the memory usage for the hybrid scheme,
so it is better to use plain sync.Pool in order to simplify the code and make it more readable and maintainable.
This is a follow-up for c22da2f917
Data ingestion benchmark doesn't show memory usage difference between two approaches,
so let's use simpler approach in order to improve code readability and maintainability.
This is a follow-up for 77c597738c
This scheme was used for reducing memory usage when vmagent runs on a machine with big number of CPU cores
and the ingestion rate isn't too big. The scheme with channel-based pool could reduce memory usage,
since it minimizes the number of PushCtx structs in the pool in this case.
Performance tests didn't reveal significant difference in memory usage under both low and high ingestion rate
between plain sync.Pool and the current hybrid scheme, so replace the scheme with plain sync.Pool in order
to simplify the code.
There was a sleep statement in the test, waiting for Group
to perform a couple of evaluation. But looks like
it worked unreliable for some CI tests like the one below
https://github.com/VictoriaMetrics/VictoriaMetrics/actions/runs/8718213844/job/23915007958?pr=6115
This commit changes the sleep statement on a function that
waits for a specific number of evaluations. It should make this
test faster in general case, and more reliable for slow environemnts.
It is incorrect applying the limit on the number of values to search without applying filters,
since the returned subset of label values may miss the label values matching the given filters.
This is a follow-up for 66630c7960
This speeds up auto-suggestion for metric names in VMUI and Grafana, which use the following query in this case:
/api/v1/label/__name__/values?match[]={__name__=~"*.some_value.*"}
When the user types `some_value` in the query input field.
This simplifies further maintenance and opens doors for additional config options
supported by lib/promauth. For example, an ability to specify client TLS certificates.
- Use exact matching by default for the query arg value provided via arg=value syntax at src_query_args.
Regex matching can be enabled by using =~ instead of = . For example, arg=~regex.
This ensures that the exact matching works as expected without the need to escape special regex chars.
- Add helper functions for creating QueryArg, Header and Regex structs in tests.
This improves maintainability of the tests.
- Remove url.QueryUnescape() call on the url in TestCreateTargetURLSuccess(), since this is bogus approach.
The url.QueryUnescape() must be applied to individual query args, and it mustn't be applied to the whole url,
since in this case it may perform invalid unescaping in the context of the url, or make the resulting url invalid.
While at it, properly marshal all the fields inside UserInfo config to yaml in tests.
Previously Header and QueryArg structs were improperly marshaled because the custom MarshalYAML
is called only on pointers to Header and QueryArg structs. This improves test coverage.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6070
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6115
If one out of 5 vmstorage nodes is unavailable, then the remaining 4 vmstorage nodes will recieve 1/5=20% increase of workload, not 5%.
If one out of 10 vmstorage nodes is unavailable, then the remaining 9 vmstorage nodes will receive 1/10=10% increase of workload, not 1%.
This is a follow-up for 458338afa5
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6099
* app/vmauth: do not increment backend_errors when hitting concurrency limit
Previously, both "vmauth_concurrent_requests_limit_reached_total" and "vmauth_user_request_backend_errors_total" were incremented.
This was based on the assumption that if concurrency limit is hit the backend must be failing to handle the request thus meaning an error.
This assumption does not work in case the endpoint can be overloaded by the misbehaving client sending too many requests within the timeframe.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5565
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
This should improve debuggability of unexpected deletion of directories inside partitions.
While at it, log the proper path to parts.json when the directory for big part is missing in the partition.
parts.json is located inside directory with small parts, and there is no parts.json file inside directory with big parts.
* docs/cluster: update cluster resizing info
- add example of resources distribution
- add info about how to handle uneven disk usage after adding a new storage node
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* Update docs/Cluster-VictoriaMetrics.md
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
* Update docs/Cluster-VictoriaMetrics.md
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
* vmui: fix parsing of fractional values
* vmui/vmanomaly: update display logic to align with vmanomaly /query_range API
* vmui/vmanomaly: rename flag anomalyView to isAnomalyView
- Automatically reload changed TLS root CA pointed by -remoteWrite.tlsCAFile command-line flag
- Automatically reload changed TLS root CA configured via oauth2.tsl_config.ca_file option at -promscrape.config
- Document the change as a feature instead of a bug at docs/CHANGELOG.md
- Simplify the code at lib/promauth, which is responsible for reloading changed TLS root CA files.
- Simplify the usage of lib/promauth.Config.NewRoundTripper() - now it accepts the base http.Transport
instead of a callback, which can change the internal http.Transport.
- Reuse the default tls config if lib/promauth.Config doesn't contain tls-specific configs.
This should reduce memory usage a bit when tls isn't used for scraping big number of targets.
- Do not re-read TLS root CA files on every processed request. Re-read them once per second.
This should reduce CPU usage when scraping big number of targets over https.
- Do not store cert.pem and key.pem files in TestTLSConfigWithCertificatesFilesUpdate, since they can be loaded
from byte slices via crypto/tls.X509KeyPair().
- Remove obsolete comparisons of string representations for authConfig and proxyAuthConfig at areEqualScrapeConfigs().
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5725
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5526
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2171
* lib/{promauth,promscrape}: automatically refresh root CA certificates after changes on disk
Added a custom `http.RoundTripper` implementation which checks for root CA content changes and updates `tls.Config` used by `http.RoundTripper` after detecting CA change.
Client certificate changes are not tracked by this implementation since `tls.Config` already supports passing certificate dynamically by overriding `tls.Config.GetClientCertificate`.
This change implements dynamic reload of root CA only for streaming client used for scraping. Blocking client (`fasthttp.HostClient`) does not support using custom transport so can't use this implementation.
See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5526
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/promauth/config: update NewRoundTripper API
Update API to allow user to update only parameters required for transport.
Add warning log when reloading Root CA failed.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/promauth/config: fix mutex acquire logic
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/promauth/config: replace RWMutex with regular mutex to simplify the code
- remove additional mutex used for getRootCABytes - require callee to use mutex
- replace RWMutex with regular mutex
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/promauth/config: refactor
- hold the mutex lock to avoid round tripper being re-created twice
- move recreation logic into separate func to simplify the code
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
- Rename -opentelemetry.sanitizeMetrics command-line flag to more clear -opentelemetry.usePrometheusNaming
- Clarify the description of the change at docs/CHANGELOG.md
- Rename promrelabel.SanitizeLabelNameParts to more clear promrelabel.SplitMetricNameToTokens
- Properly split metric names at '_' char in promerlabel.SplitMetricNameToTokens.
- Add tests for various edge cases for Prometheus metric names' normalization
according to the code at b865505850/pkg/translator/prometheus/normalize_name.go
- Extract the code responsible for Prometheus metric names' normalization into a separate file (santize.go)
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6037
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6035
- Make the configuration more clear by accepting the list of ignored labels during sharding
via a dedicated command-line flag - -remoteWrite.shardByURL.ignoreLabels.
This prevents from overloading the meaning of -remoteWrite.shardByURL.labels command-line flag.
- Removed superfluous memory allocation per each processed sample if sharding by remote storage is enabled.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5938
Remove description for -search.maxExportDuration and -search.maxStatusRequestDuration command-line flags
from the 'Resource usage limits' chapter, since these flags are rarely used for limiting resource usage
and they are already documented in the 'List of command-line flags' chapter.
- Fix docs for new functions at app/vmselect/graphite/functions.json
- Properly drain series lists on errors in aggregateSeriesListsGeneric() and aggregateSeriesList()
- Add links to docs for the added functions at docs/CHANGELOG.md
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5809
- Allow specifying only a single HTTP header for reading auth tokens via -httpAuthHeader command-line flag.
This is better from security PoV, since this prevents from accidental reading of auth token from undesired
HTTP header. By default the -httpAuthHeader equals to Authorization. When it is overridden, then
auth token isn't read from Authorization header - it is read only from the specified header.
- Document the -httpAuthHeader command-line flag at https://docs.victoriametrics.com/vmauth/#reading-auth-tokens-from-other-http-headers
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6009
* docs/vmanomaly: v1.12.0 & link updates
* add autotuned description to model section
* - update refs of vmanomaly on enterprise and vmalert pages
- add diagrams for model types
- update self-monitoring section
* - fix typos
- remove .index.html from links
This reverts commit cb23685681.
Reason for revert: the "fix" may hide programming bugs related to incorrect creation of folders
before their use. This may complicate detecting and fixing such bugs in the future.
There are the following fixes for the issue https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5985 :
- To configure the OS to do not drop data from the system-wide temporary directory (aka /tmp).
- To run VictoriaMetrics with -cacheDataPath command-line flag, which points to the directory,
which cannot be removed automatically by the OS.
The case when the user accidentally deletes the directory with some files created by VictoriaMetrics
shouldn't be considered as expected, so VictoriaMetrics shouldn't try resolving this case automatically.
It is much better from operation and debuggability PoV is to crash with the clear `directory doesn't exist` error
in this case.
The remotewrite.Stop() expects that there are no pending calls to TryPush().
This means that the ingestionRateLimiter.Register() must be unblocked inside TryPush() when calling remotewrite.Stop().
Provide remotewrite.StopIngestionRateLimiter() function for unblocking the rate limiter before calling the remotewrite.Stop().
While at it, move the rate limiter into lib/ratelimiter package, since it has two users.
Also move the description of the feature to the correct place at docs/CHANGELOG.md.
Also cross-reference -remoteWrite.rateLimit and -maxIngestionRate command-line flags.
This is a follow-up for 02bccd1eb9
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5900
Custom HTTP headers are set via net/http.Header.Set or net/http.Header.Add functions.
These functions always convert header keys to canonical form. So the change at b577413d3b
isn't visible to users of VictoriaMetrics components.
There is no need in documenting this change at docs/CHANGELOG.md, since it doesn't give any useful information to users.
This is a follow-up for e6dd52b04c
* lib/storage: add ability to use downsampling for the given series filter
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* docs: add information about downsampling filters
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* docs: fix MetricsQL filter
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/storage/downsampling: treat missing downsampling filter as a bug
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/storage/part_header: verify correctness of downsampling filters when opening partition
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/storage/downsampling: save only appliable rules in part metadata
Filter and save only rules which are appliable to partition based on MinTimestamp of stored data.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/storage/downsampling: update log messages for final dedup
Properly specify a reason of re-running deduplication for partition.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/storage: consistently use MaxTimestamp to determine deduplication/downsampling rules
Using MinTimestamp leads to applying downsampling to parts which are only partially covered by downsampling rule.
For example, partition covers range [1000-2000]. At t=2100 and rule offset 500 data with t=2100-500 => 1600 must be downsampled. The range check against MinTimestamp evaluates to true even though partition contains range which must not be downsampled - [1600:2000].
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* Follow-up
- Apply the first matching downsampling period if multiple filters match the given time series.
This allows fine-tuning the downsampling config for the specific needs.
- Take into account downsampling filters during search queries.
- Reduce the difference between community and enterprise branches. This should simplify further maintenance of these branches.
- Properly parse series filters with colons inside them.
- Document the feature at docs/CHANGELOG.md.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4960
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* lib/storage: adds metrics for downsampling
vm_downsampling_partitions_scheduled - shows the number of parts, that must be downsampled
vm_downsampling_partitions_scheduled_size_bytes - shows total size in bytes for parts, the must be donwsampled
These two metrics answer the questions - is downsampling running? how many parts scheduled for downsampling and how many of them currently downsampled? Storage space that it occupies.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2612
* wip
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* rm recommendation to keep look-behind window empty, as it is not correct
* mention the change of default value for `-search.latencyOffset`
Signed-off-by: hagen1778 <roman@victoriametrics.com>
During shutdown period of vmalert, remotewrite client retrieve all pending time series from buffer queue, compose them into 1 batch and execute remote write.
This final batch may exceed the limit of -remoteWrite.maxBatchSize, and be rejected by the receiver (gateway, vmcluster or others).
This changes ensures that even during shutdown vmalert won't exceed the max batch size limit for remote write
destination.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/6025
To horizontally scale streaming aggregation, you might want to deploy a separate hashing tier
of vmagents that route to a separate aggregation tier. The hashing tier should shard by all labels
except the instance-level labels, to ensure the input metrics are routed correctly to the aggregator
instance responsible for those labels.
For this to achieve we introduce `remoteWrite.shardByURL.inverseLabels` flag to inverse logic of `remoteWrite.shardByURL.labels`
---------
Co-authored-by: Eugene Ma <eugene.ma@airbnb.com>
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
* vmalert: fix sending alert messages
1. fix `endsAt` field in messages that send to alertmanager, previously rule with small interval could never be triggered;
2. fix behavior of `-rule.resendDelay`, before it could prevent sending firing message when rule state is volatile.
* docs: update changelog notes
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
Store the deadline when the metricID entries must be deleted from indexdb
if metricID->metricName entry isn't found after the deadline. This should
make the code more clear comparing the the previous version, where the timestamp
of the first metricID->metricName lookup miss was stored in missingMetricIDs.
Remove the misleading comment about the importance of the order for creating entries
in the inverted index when registering new time series. The order doesn't matter,
since any subset of the created entries can become visible for search
before any other subset after registering in indexdb.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5948
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5959
* lib/storage/table: properly wait for force merges to be completed during shutdown
Properly keep track of running background merges and wait for merges completion when closing the table.
Previously, force merge was not in sync with overall storage shutdown which could lead to holding ptw ref.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* docs: add changelog entry
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
vmselect uses a cache folder in file system for two purposes:
1. Storing rollup cache results on shutdown;
2. Storing temporary search results from vmstorage during query executions.
It could happen that cache folder is deleted accidentally by user, or by OS
during cleanup routines. This would cause vmselect to:
1. panic on /metrics call, because `MustGetFreeSpace` will fail;
2. return query error user, as it won't be able to store temporary search results.
The changes in this commit are the following:
1. Make `MustGetFreeSpace` to try re-creating the cache folder if it is missing;
2. Make vmselect to try re-creating the cache folder if it can't persist tmp search
results.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5985
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
* [vmagent] added ingestion rate limiting with new flag `-maxIngestionRate`. This flag can be used to limit the number of samples ingested by vmagent per second. If the limit is exceeded, the ingestion rate will be throttled.
* fix changelog
* fix review comment
When `--vm-native-step-interval` is specified, explore phase will be executed
within specified intervals. Discovered metric names will be associated with
time intervals at which they were discovered. This suppose to reduce number
of requests vmctl makes per metric name since it will skip time intervals
when metric name didn't exist.
This should also reduce probability of exceeding complexity limits
for number of selected series in one request during explore phase.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5369
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Links with upper-case simply don't work for unknown reason.
Once the reason is fixed on docs side, this commit can be reverted.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
It is confusing for cluster users to find that VMUI is available at vmselect as it is only mentioned in the list of URLs. Explicit mention of vmselect URL in docs will make it easier to discover.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
This makes the code less fragile - it is harder to skip the convertToCompositeTagFilterss() call now.
While at it, call indexSearch.containsTimeRange() inside indexSearch.searchMetricIDsInternal()
in order to quickly terminate search of time series in the old indexdb for new time ranges.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5055
This is a follow-up for 2d31fd7855
It is assumed that URLPrefix.busOriginal will be initialized
durin Unmarshal of the config. But in tests we set fields manually,
so this field never get initialized properly.
Fixes the error `panic: runtime error: integer divide by zero`
at `vmauth.getLeastLoadedBackendURL`.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
If tlsServerName isn't empty, then it is likely the https request is sent to IP instead of hostname.
In this case the request will fail, since Go automatically sets the Host header to the IP instead
of the desired hostname at tlsServerName. So set the Host header to tlsServerName if itsn't empty.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5802
The `match[]` filter is mandatory at /api/v1/series, so it mustn't be dropped here.
There is no sense in dropping `match[]` filter together with `extra_label` and `extra_filters[]`
at /api/v1/labels and /api/v1/label/.../values if -search.ignoreExtraFiltersAtLabelsAPI commnad-line flag is set,
since:
- the `match[]` filter triggers slow path at these APIs;
- the `extra_label` and `extra_filters[]` filters narrow down the number of matched time series,
so they improve performance comparing to the case when only `match[]` filter is left,
while `extra_label` and `extra_filters[]` filters are dropped.
This is a follow-up for 0b7a23a91d
Improved trace display for better visual separation of branches:
* Increased left padding for each element
* Added padding for the last element in the branch
The change also removes misleading `default` value from README for `maxConcurrentInserts`
cmd-line flag.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
This reverts commit eb40395a1c.
Reason for revert: it has been appeared that the performance gain on multiple CPU cores
wasn't visible because the benchmark was generating incorrect pushSample.key.
See a207e0bf687d65f5198207477248d70c69284296
Previously samples were dropped on the first incomplete interval and the next complete interval.
Also make sure that the de-duplication is performed just before flushing the aggregate state.
This should help the case then dedup_interval = interval.
* use docs.victoriametrics.com instead of github docs
* add links to common terms used in VictoriaMetrics
Signed-off-by: hagen1778 <roman@victoriametrics.com>
For example, if `interval: 1m`, then data flush occurs at the end of every minute,
while `interval: 1h` leads to data flush at the end of every hour.
Add `no_align_flush_to_interval` option, which can be used for disabling the alignment.
The labelsMap struct employs the fact that label indexes are condensed around 0,
so it stores the referred labels in a slice instead of map and uses slice index as label key.
This allows increasing the LabelsCompressor.Decompress performance by up to 3x.
This also reduces the latency of data flush in stream aggregation.
- Reduce memory usage by up to 5x when de-duplicating samples across big number of time series.
- Reduce memory usage by up to 5x when aggregating across big number of output time series.
- Add lib/promutils.LabelsCompressor, which is going to be used by other VictoriaMetrics components
for reducing memory usage for marshaled []prompbmarshal.Label.
- Add `dedup_interval` option at aggregation config, which allows setting individual
deduplication intervals per each aggregation.
- Add `keep_metric_names` option at aggregation config, which allows keeping the original
metric names in the output samples.
- Add `unique_samples` output, which counts the number of unique sample values.
- Add `increase_prometheus` and `total_prometheus` outputs, which ignore the first sample
per each newly encountered time series.
- Use 64-bit hashes instead of marshaled labels as map keys when calculating `count_series` output.
This makes obsolete https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5579
- Expose various metrics, which may help debugging stream aggregation:
- vm_streamaggr_dedup_state_size_bytes - the size of data structures responsible for deduplication
- vm_streamaggr_dedup_state_items_count - the number of items in the deduplication data structures
- vm_streamaggr_labels_compressor_size_bytes - the size of labels compressor data structures
- vm_streamaggr_labels_compressor_items_count - the number of entries in the labels compressor
- vm_streamaggr_flush_duration_seconds - a histogram, which shows the duration of stream aggregation flushes
- vm_streamaggr_dedup_flush_duration_seconds - a histogram, which shows the duration of deduplication flushes
- vm_streamaggr_flush_timeouts_total - counter for timed out stream aggregation flushes,
which took longer than the configured interval
- vm_streamaggr_dedup_flush_timeouts_total - counter for timed out deduplication flushes,
which took longer than the configured dedup_interval
- Actualize docs/stream-aggregation.md
The memory usage reduction increases CPU usage during stream aggregation by up to 30%.
This commit is based on https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5850
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5898
The match[] at /api/v1/labels and /api/v1/label/.../values also may lead to slow requests and
high resource usage if it matches big number of time series. So it must be igrnored if -search.ignoreExtraFiltersAtLabelsAPI
command-line flag is set.
This is a follow-up for fab02faa3f
* app/{vmagent,vminsert}: adds /v1/metrics suffix for opentelemetry route path
it must fix compatibility with opentemetry-collector [spec](https://opentelemetry.io/docs/specs/otlp/\#otlphttp-request)
this suffix is hard-coded and cannot be changed with collector configuration
* Apply suggestions from code review
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* metricsql: fix label_join() when `dst_label` is equal to one of the `src_label`
* Update app/vmselect/promql/transform.go
* Update docs/CHANGELOG.md
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
- Document the ability to read OpenTelemetry data from Amazon Firehose at docs/CHANGELOG.md
- Simplify parsing Firehose data. There is no need in trying to optimize the parsing with fastjson
and byte slice tricks, since OpenTelemetry protocol is really slooow because of over-engineering.
It is better to write clear code for better maintanability in the future.
- Move Firehose parser from /lib/protoparser/firehose to lib/protoparser/opentelemetry/firehose,
since it is used only by opentelemetry parser.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5893
* deployment: create a separate env for VictoriaLogs
The new environment consists of the following components:
* VictoriaLogs
* fluentbit for collecting logs and sending to VictoriaLogs
* VictoriaMetrics for scraping and storing metrics from fluentbit and VictoriaLogs
* Grafana with VictoriaLogs datasource for monitoring
-----------------
The motivation for creating a separate environment is to simplify existing environments
and make it easier to update or modify them in future.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Prevsiously they were swapped - the first arg should be the label name and the second arg should be label filters
This is a follow-up for e389b7b959e8144fdff5075bf7a5a39b2b0c6dd3
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5847
This solves two issues:
1. The vm_backups_uploaded_bytes_total metric will grow more smoothly
2. This prevents from int overflow at metrics.Counter.Add() when uploading files bigger than 2GiB
This should simplify code maintenance by gradually converting to atomic.* types instead of calling atomic.* functions
on int and bool types.
See ea9e2b19a5
The issue has been introduced in bace9a2501
The improper fix was in the d4c0615dcd ,
since it fixed the issue just by an accident, because Go comiler aligned the rawRowsShards field
by 4-byte boundary inside partition struct.
The proper fix is to use atomic.Int64 field - this guarantees that the access to this field
won't result in unaligned 64-bit atomic operation. See https://github.com/golang/go/issues/50860
and https://github.com/golang/go/issues/19057
It has been appeared that there are VictoriaMetrics users, who rely on the fact that
VictoriaMetrics components were closing incoming connections to -httpListenAddr every 2 minutes
by default. So let's return back this value by default in order to fix the breaking change
made at d8c1db7953 .
See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1304#issuecomment-1961891450 .
Previously the (date, metricID) entries for dates older than the last 2 days were removed.
This could lead to slow check for the (date, metricID) entry in the indexdb during ingesting historical data (aka backfilling).
The issue has been introduced in 431aa16c8d
This commit returns back limits for these endpoints, which have been removed at 5d66ee88bd ,
since it has been appeared that missing limits result in high CPU usage, while the introduced concurrency limiter
results in failed lightweight requests to these endpoints because of timeout when heavyweight requests are executed.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5055
* vmui: add a time picker to the "Logs Explorer" page #5673
* Update app/vmui/packages/vmui/src/pages/ExploreLogs/hooks/useFetchLogs.ts
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Do not convert shard items to part when a shard becomes full. Instead, collect multiple
full shards and then convert them to a searchable part at once. This reduces
the number of searchable parts, which, in turn, should increase query performance,
since queries need to scan smaller number of parts.
* app/vmselect: adds milliseconds to the csv export response for rfc3339
* milliseconds is a standard prescion for VictoriaMetrics query request responses
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5837
* app/victoria-metrics: adds tests for csv export/import
follow-up after 3541a8d0cf96dd4f8563624c4aab6816615d0756
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
Previously the interval between item addition and its conversion to searchable in-memory part
could vary significantly because of too coarse per-second precision. Switch from fasttime.UnixTimestamp()
to time.Now().UnixMilli() for millisecond precision. It is OK to use time.Now() for tracking
the time when buffered items must be converted to searchable in-memory parts, since time.Now()
calls aren't located in hot paths.
Increase the flush interval for converting buffered samples to searchable in-memory parts
from one second to two seconds. This should reduce the number of blocks, which are needed
to be processed during high-frequency alerting queries. This, in turn, should reduce CPU usage.
While at it, hardcode the maximum size of rawRows shard to 8Mb, since this size gives the optimal
data ingestion pefromance according to load tests. This reduces memory usage and CPU usage on systems
with big amounts of RAM under high data ingestion rate.
The pooled rawRowsBlock objects occupies big amounts of memory between flushes,
and the flushes are relatively rare. So it is better to don't use the pool
and to allocate rawRow blocks on demand. This should reduce the average
memory usage between flushes.
The buffer can be quite big under high ingestion rate (e.g. more than 100MB).
This leads to increased memory usage between buffer flushes.
So it is better to re-create the buffer on every flush in order to reduce memory usage
between buffer flushes.
This should improve maintainability of the code related to rollup functions,
since it is located in rollup.go
While at it, properly return empty results from holt_winters(), rate_over_sum(),
sum2_over_time(), geomean_over_time() and distinct_over_time() when there are no real samples
on the selected lookbehind window. Previously the previous sample value was mistakenly
returned from these functions.
* [lib/promutils, lib/httputils] fixed floating-point error when parsing time in RFC3339 format (#5801)
* fixed tests
* fixed test
* Revert "fixed test"
This reverts commit 8a29764806.
* Revert "fixed tests"
This reverts commit 9ce13d1042.
* Revert "[lib/promutils, lib/httputils] fixed floating-point error when parsing time in RFC3339 format (#5801)"
This reverts commit a7a04bd4
* [lib/httputils] fixed floating-point error when parsing time in RFC3339 format (#5801)
---------
Co-authored-by: Nikolay <nik@victoriametrics.com>
Currently the docker-compose examples for loading `victoriametrics-datasource` uses 2 environment variables:
- `GF_ALLOW_LOADING_UNSIGNED_PLUGINS`
- `GF_DEFAULT_APP_MODE`
I believe both of the env vars are trying to achieve the same thing. `GF_DEFAULT_APP_MODE` disables code signing for all plugins and `GF_ALLOW_LOADING_UNSIGNED_PLUGINS` intends to disable code signing for just `victoriametrics-datasource`.
Keeping the scope narrowed to just `victoriametrics-datasource` would be preferable in this case.
Unfortunately `GF_ALLOW_LOADING_UNSIGNED_PLUGINS` is misspelled. According to [grafana docs](https://grafana.com/docs/grafana/latest/setup-grafana/configure-grafana/#override-configuration-with-environment-variables), the format is supposed to be `GF_<SectionName>_<KeyName>`. In other words the current env var is missing the section name.
This PR proposes to:
1. fix the typo
2. remove the global disablement of code signing
Alternatively, if you prefer to keep codesigning disabled globally, please remove `GF_ALLOW_LOADING_UNSIGNED_PLUGINS` env var as it confuses things
* support case-insensitive search
* reflect search condition in URL, so link can be sharable
* support filtering on /alerts page
* fix collapseAll/expandAll logic to respect only shown entries
* add changelog
b60dcbe11f
Signed-off-by: hagen1778 <roman@victoriametrics.com>
- Consistently return the first `limit` log entries if the total size of found log entries doesn't exceed 1Mb.
See app/vlselect/logsql/sort_writer.go . Previously random log entries could be returned with each request.
- Document the change at docs/VictoriaLogs/CHANGELOG.md
- Document the `limit` query arg at docs/VictoriaLogs/querying/README.md
- Make the change less intrusive.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5674
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5778
* - apply v1.10 changes
- chapter on model types (uni/multivariate and rolling)
* - update self-monitoring labels description
- fix typos
* fix duplicated text and link rendering
* app/vmauth: properly release memory during config reload
previously metrics package hold a refrence for channels for users concurrent requests.
it case of churn at `name` field of users configuration, new metric was created. But previous one wasn't deleted.
It prevented full parsed configuration from being garbace collected.
now all config related metrics are bound to corresponding metrics.Set and unregistered during config reload process.
It also must fix an issue with incorrect values for current concurrent user requests
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4690
* wip
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Instead, log a sample of these long items once per 5 seconds into error log,
so users could notice and fix the issue with too long labels or too many labels.
Previously this panic could occur in production when ingesting samples with too long labels.
The 3c246cdf00 added an optimization where the previous metaindexRow
could be saved to disk when the current block header couldn't be added indexBlock because the resulting
indexBlock size became too big. This could result in an empty metaindexRow.firstItem for the next metaindexRow.
This allows removing importing unneeded command-line flags into binaries, which import lib/storage,
which, in turn, was importing lib/snapshot in order to use Time, Validate and NewName functions.
This is a follow-up for 83e55456e2
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5738
Closing client connections every 2 minutes doesn't help load balancing -
this just leads to "jumpy" connections between multiple backend servers,
e.g. the load isn't spread evenly among backend servers, and instead jumps
between the servers every 2 minutes.
It is still possible periodically closing client connections by specifying non-zero -http.connTimeout command-line flag.
This should help with https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1304#issuecomment-1636997037
This is a follow-up for d387da142e
Sync description after changes in 83e55456e2
Remove extra text in flags description for vmbackupmanager as it breaks layout
when rendered at docs.victoriametrics.com
Signed-off-by: hagen1778 <roman@victoriametrics.com>
There is no sense in storing commonPrefix for blockHeader containing only a single item,
since this only increases blockHeader size without any benefits.
This panic could occur when samples with too long label values are ingested into VictoriaMetrics.
This could result in too long fistItem and commonPrefix values at blockHeader (up to 64kb each).
This may inflate the maximum index block size by 4 * maxIndexBlockSize.
* add more details to changelog
* simplify panels description
* remove capacity planning recommendation, as it proves it incompetent
Signed-off-by: hagen1778 <roman@victoriametrics.com>
During background downsampling, rate(vm_deduplicated_samples_total{type="merge"}) could be much bigger than
rate(vm_rows_added_to_storage_total) and it could last quite some time,
which causes negative values of Storage full ETA and confuses users, see playground.
Instead of trying to get more accurate results during downsampling, I think it's ok to ignore
vm_deduplicated_samples_total at all, it's more reasonable to see Storage full ETA increase after downsampling.
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
changes(), increases_over_time() and resets() shouldn't take into account
value changes, which may occur because of precision errors.
The maximum guaranteed precision for raw samples stored in VictoriaMetrics is 12 decimal digits.
So do not count relative changes for values if they are smaller than 1e-12 comparing to the value.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/767
For example, -fooDuration=',10s,' is now supported - it sets three command-line flag values:
- the first and the last one are set to the default value for `-fooDuration`
- the second one is set to 10s
* vmui: fix select closing on click outside (#5728)
* vmui: clear entered text in select after selecting a value (#5727)
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
This should significantly reduce the number of open ReaderAt files
on VictoriaMetrics and VictoriaLogs startup.
The open files can be tracked via vm_fs_readers metric
GOGC can be already set via environment variable. There is no need in adding
new approaches for setting the GOGC (such as command-line flag), since they complicate operations.
It should help identifying cases when too much CPU is spent on garbage collection,
and advice users on how this can be addressed.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Remove temporary file before closing it in order to signal the OS that it shouldn't
store the file contents from page cache to disk when the file is closed.
Gracefully handle the case when the file cannot be removed before being closed -
in this case remove the file after closing it. This allows working on Windows.
Also remove superflouos opening of temporary file for reading - re-use already opened file handle for writing.
This is a follow-up for 9b1e002287
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4020
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/70
The easyproto-based marshaler is 2x slower than the previous custom marshaler,
so let's stick with it. This improves the performance for sending data to remote storage at vmagent
and reduces CPU usage to pre-v1.97.0 levels.
This case is possible after a new brsPool is allocated. The fix is to verify whether len(brsPool) >= len(brs.brs)
before trying to append a new item to brsPool and sharing its contents with brs.brs.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5733
* app/vmselect: set proper timestamp for cached instant responses
The change updates `getSumInstantValues` to prefer timestamp
from the most recent results. Before, timestamp from cached series
was used.
The old behavior had negative impact on recording rules as they
were getting responses with shifted timestamps in past.
Subsequent recording or alerting rules fetching results of these
recording rules could get no result due to staleness interval.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5659
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* wip
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
There is no sense in running more than GOMAXPROCS concurrent marshalers,
since they are CPU-bound. More concurrent marshalers do not increase the marshaling bandwidth,
but they may result in more RAM usage.
Entries for the previous dates is usually not used, so there is little sense in keeping them in memory.
This should reduce the size of storage/date_metricID cache, which can be monitored
via vm_cache_entries{type="storage/date_metricID"} metric.
This limit has little sense for these APIs, since:
- Thses APIs frequently result in scanning of all the time series on the given time range.
For example, if extra_filters={datacenter="some_dc"} .
- Users expect these APIs shouldn't hit the -search.maxUniqueTimeseries limit,
which is intended for limiting resource usage at /api/v1/query and /api/v1/query_range requests.
Also limit the concurrency for /api/v1/labels, /api/v1/label/.../values
and /api/v1/series requests in order to limit the maximum memory usage and CPU usage for these API.
This limit shouldn't affect typical use cases for these APIs:
- Grafana dashboard load when dashboard labels should be loaded
- Auto-suggestion list load when editing the query in Grafana or vmui
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5055
* port un-synced changed from docs/readme to readme
* consistently use `sh` instead of `console` highlight, as it looks like
a more appropriate syntax highlight
* consistently use `sh` instead of `bash`, as it is shorter
* consistently use `yaml` instead of `yml`
See syntax codes here https://gohugo.io/content-management/syntax-highlighting/
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Optimize the performance of data merge: decimal.CalibrateScale() from 49633 ns/op to 9146 ns/op
* Optimize the performance of data merge: decimal.CalibrateScale()
Sending unfinished aggregate states tend to produce unexpected anomalies with lower values than expected.
The old behavior can be restored by specifying `flush_on_shutdown: true` setting in streaming aggregation config
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* app/vmalert: fix data race during hot-config reload
During hot-reload, the logic evokes the group update and rules evaluation
interruption simultaneously. Falsely assuming that interruption happens before
the update. However, it could happen that group will be updated first and only
after the rules evaluation will be cancelled. Which will result in permanent
interruption for all rules within the group.
The fix caches the cancel context function into local variable first. And only after
performs the group update. With cached cancel function we can safely call it without
worrying that we cancel the evaluation for already updated group.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Revert "app/vmalert: fix data race during hot-config reload"
This reverts commit a4bb7e8932.
* app/vmalert: fix data race during hot-config reload
During hot-reload, the logic evokes the group update and rules evaluation
interruption simultaneously. Falsely assuming that interruption happens before
the update. However, it could happen that group will be updated first and only
after the rules evaluation will be cancelled. Which will result in permanent
interruption for all rules within the group.
The fix cancels the evaulation context before applying the update, making sure
that the context will be cancelled for old group always.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* wip
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
- Maintain a separate worker pool per each part type (in-memory, file, big and small).
Previously a shared pool was used for merging all the part types.
A single merge worker could merge parts with mixed types at once. For example,
it could merge simultaneously an in-memory part plus a big file part.
Such a merge could take hours for big file part. During the duration of this merge
the in-memory part was pinned in memory and couldn't be persisted to disk
under the configured -inmemoryDataFlushInterval .
Another common issue, which could happen when parts with mixed types are merged,
is uncontrolled growth of in-memory parts or small parts when all the merge workers
were busy with merging big files. Such growth could lead to significant performance
degradataion for queries, since every query needs to check ever growing list of parts.
This could also slow down the registration of new time series, since VictoriaMetrics
searches for the internal series_id in the indexdb for every new time series.
The third issue is graceful shutdown duration, which could be very long when a background
merge is running on in-memory parts plus big file parts. This merge couldn't be interrupted,
since it merges in-memory parts.
A separate pool of merge workers per every part type elegantly resolves both issues:
- In-memory parts are merged to file-based parts in a timely manner, since the maximum
size of in-memory parts is limited.
- Long-running merges for big parts do not block merges for in-memory parts and small parts.
- Graceful shutdown duration is now limited by the time needed for flushing in-memory parts to files.
Merging for file parts is instantly canceled on graceful shutdown now.
- Deprecate -smallMergeConcurrency command-line flag, since the new background merge algorithm
should automatically self-tune according to the number of available CPU cores.
- Deprecate -finalMergeDelay command-line flag, since it wasn't working correctly.
It is better to run forced merge when needed - https://docs.victoriametrics.com/#forced-merge
- Tune the number of shards for pending rows and items before the data goes to in-memory parts
and becomes visible for search. This improves the maximum data ingestion rate and the maximum rate
for registration of new time series. This should reduce the duration of data ingestion slowdown
in VictoriaMetrics cluster on e.g. re-routing events, when some of vmstorage nodes become temporarily
unavailable.
- Prevent from possible "sync: WaitGroup misuse" panic on graceful shutdown.
This is a follow-up for fa566c68a6 .
Thanks @misutoth to for the inspiration at https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5212
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5190
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3790
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3551
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3337
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3425
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3647
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3641
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/648
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/291
* docs: remove raw and endraw tags as they are not needed for the new version of site
Signed-off-by: Artem Navoiev <tenmozes@gmail.com>
* revert formating in vmaler
Signed-off-by: Artem Navoiev <tenmozes@gmail.com>
---------
Signed-off-by: Artem Navoiev <tenmozes@gmail.com>
The maxFileParts usage has been accidentally removed in fa566c68a6
While at it, add Count suffix to *AssistedMerges counter names in order to make them less misleading.
Previously their names were falsely suggesting that these are gauges, which show the number of concurrently
executed assisted merges.
This reverts cd4f641d32 , since it has been appeared that the disabled compression
for vmstorage->vmselect data increase network bandwidth usage by more than 10x on typical production workloads,
while it decreases CPU usage at vmstorage by up to 10% and improves query latency by up to 10%.
The 10x increase in network usage is too high price for 10% improvements on query latency and vmstorage CPU usage.
This may result in network bandwidth bottlenecks, which can reduce the overall performance and stability
of VictoriaMetrics cluster. That's why return back the vmstorage->vmselect data compression by default.
The vmstorage->vmselect compression can be disabled by passing -rpc.disableCompression command-line flag to vmstorage.
The vmselect->vmselect compression in multi-level cluster setup can be disabled by passing -clusternative.disableCompression command-line flag.
It has been appeared that the registration of new time series slows down linearly
with the number of indexdb parts, since VictoriaMetrics needs to check every indexdb part
when it searches for TSID by newly ingested metric name.
The number of in-memory parts grows when new time series are registered
at high rate. The number of in-memory parts grows faster on systems with big number
of CPU cores, because the mergeset maintains per-CPU buffers with newly added entries
for the indexdb, and every such entry is transformed eventually into a separate in-memory part.
The solution has been suggested in https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5212
by @misutoth - to limit the number of in-memory parts with buffered channel.
This solution is implemented in this commit. Additionally, this commit merges per-CPU parts
into a single part before adding it to the list of in-memory parts. This reduces CPU load
when searching for TSID by newly ingested metric name.
The https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5212 recommends setting the limit on the number
of in-memory parts to 100, but my internal testing shows that much lower limit 15 works with the same efficiency
on a system with 16 CPU cores while reducing memory usage for `indexdb/dataBlocks` cache by up to 50%.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5190
This allows reducing the indexdb/tagFiltersToMetricIDs cache size by 8 on average.
The cache size can be checked via vm_cache_size_bytes{type="indexdb/tagFiltersToMetricIDs"} metric exposed at /metrics page.
Measuring read / write duration from / to in-memory buffers has little sense,
since it will be always fast. It is better to measure read / write duration from / to
real files at vm_filestream_write_duration_seconds_total and vm_filestream_read_duration_seconds_total metrics.
This also reduces overhead on time.Now() and Histogram.UpdateDuration() calls
per each filestream.Reader.Read() and filestream.Writer.Write() call when the data is read / written from / to in-memory buffers.
This is a follow-up for 2f63dec2e3
* lib/promscrape: respect `0` value for `series_limit` param
Respect `0` value for `series_limit` param in `scrape_config`
even if global limit was set via `-promscrape.seriesLimitPerTarget`.
Previously, `0` value will be ignored in favor of `-promscrape.seriesLimitPerTarget`.
This behavior aligns with possibility to override `series_limit` value via
relabeling with `__series_limit__` label.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Update docs/CHANGELOG.md
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Previously, it was not possible to determine which tenant sends metrics with excessive amount of labels of label values.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* vmui: fix the logic of closing the popper #5470
* vmui: fix the logic of caching autocomplete results #5472
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
This reduces the number of memory allocations at the cost of possible memory usage increase,
since now different metric name strings may hold references to the previous byte slice.
This is good tradeoff, since ProcessSearchQuery is called in vmselect, and vmselect isn't usually limited by memory.
This change has been extracted from https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5527
This should smooth CPU and RAM usage spikes related to these periodic tasks,
by reducing the probability that multiple concurrent periodic tasks are performed at the same time.
The dateMetricIDCache puts recently registered (date, metricID) entries into mutable cache protected by the mutex.
The dateMetricIDCache.Has() checks for the entry in the mutable cache when it isn't found in the immutable cache.
Access to the mutable cache is protected by the mutex. This means this access is slow on systems with many CPU cores.
The mutabe cache was merged into immutable cache every 10 seconds in order to avoid slow access to mutable cache.
This means that ingestion of new time series to VictoriaMetrics could result in significant slowdown for up to 10 seconds
because of bottleneck at the mutex.
Fix this by merging the mutable cache into immutable cache after len(cacheItems) / 2
cache hits under the mutex, e.g. when the entry is found in the mutable cache.
This should automatically adjust intervals between merges depending on the addition rate
for new time series (aka churn rate):
- The interval will be much smaller than 10 seconds under high churn rate.
This should reduce the mutex contention for mutable cache.
- The interval will be bigger than 10 seconds under low churn rate.
This should reduce the uneeded work on merging of mutable cache into immutable cache.
The change removes artificial delay before returning error, which sometimes
caused less retry events than expected.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* lib/promscrape/discovery/kubernetes: fix watcher start order for roles endpoints and endpointslice
Previously the groupWatcher could be mistakenly stopped when requests for pod or services resources take too long.
* remove mislead comment
* docs/sd_configs.md: mention -promscrape.kubernetes.attachNodeMetadataAll flag in the description for attach_metadata section
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4640
* wip
* lib/promscrape/kubernetes: prevent from stopping groupWatcher when there are in-flight apiWatcher.mustStart() calls
groupWatcher is stopped if it has zero registered apiWatchers during 14 seconds.
But such a groupWatcher can be still in use if apiWatcher for `role: endpoints` or `role: endpointslice`
is being registered and the discovery of the associated `pod` and/or `service` objects takes longer
than 14 seconds - see the beginning of groupWatcher.startWatchersForRole() function for details.
Track the number of in-flight calls to apiWatcher.mustStart() and prevent from stopping the associated groupWatcher
if the number of in-flight calls is non-zero.
P.S. postponing the discovery of `pod` and/or `service` objects associated with `endpoints` or `endpointslice` roles
isn't the best solution, since it slows down initial discovery of `endpoints` and `endpointslice` targets.
* typo fix
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Examples:
1) -metricsAuthKey=file:///abs/path/to/file - reads flag value from the given absolute filepath
2) -metricsAuthKey=file://./relative/path/to/file - reads flag value from the given relative filepath
3) -metricsAuthKey=http://some-host/some/path?query_arg=abc - reads flag value from the given url
The flag value is automatically updated when the file contents changes.
* app/vmauth: adds metric_labels and backend_errors counter
it must improve observability for user requests with new metric - per user backend errors counter.
it's needed to calculate requests fail rate to the configured backends.
metric_labels configuration allows to perform additional aggregations on top of multiple users from configuration section.
It could be multiple clients or clients with separate read/write tokens
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5565
* wip
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* app/vmselect/promql: properly handle possible negative results caused by float operations precision error in rollup functions like rate() or increase()
* fix test
- docs/sd_configs.md: moved hetzner_sd_configs docs to the correct place according to alphabetical order of SD names,
document missing __meta_hetzner_role label.
- lib/promscrape/config.go: added missing MustStop() call for Hetzner SD,
and moved the code to the correct place according to alphabetical order of SD names.
- lib/promscrape/discovery/hetzner: properly handle pagination for hloud API responses,
populate missing __meta_hetzner_role label like Prometheus does.
- Properly populate __meta_hetzner_public_ipv6_network label like Prometheus does.
- Remove unused SDConfig.Token.
- Remove "omitempty" annotation from SDConfig.Role field, since this field is mandatory.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5550
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3154
It is likely this value was borrowed from panel `Slow inserts` panel
from Grafana dasbhoard for VM single/cluster installations. This is a mistake.
There is no such thing as "tolerable churn rate", as tolerancy depends on the amount
of allocated resources.
Although, it is unclear what is meant by 5%. If it refers to 5% of new time series per second,
then such churn rate is extremely high. It would mean that the avg life of a time series is 20s.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
app/victoriametrics: update the test suite
* simplify /query and /query_range test cases configuration and tests
* support instant queries with lookbehind window like `query=foo[5m]`
* support instant queries selecting scalar value like `query=42`
* add query_range test for prometheus
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmui/vmanomaly: add support models that produce only `anomaly_score`
* vmui/vmanomaly: fix display legend
* vmui/vmanomaly: update comment on anomaly threshold
- Clarify the bugfix description at docs/CHANGELOG.md
- Simplify the code by accessing prefetchedMetricIDs struct under the lock
instead of using lockless access to immutable struct.
This shouldn't worsen code scalability too much on busy systems with many CPU cores,
since the code executed under the lock is quite small and fast.
This allows removing cloning of prefetchedMetricIDs struct every time
new metric names are pre-fetched. This should reduce load on Go GC,
since the cloning of uin64set.Set struct allocates many new objects.
Add docs to explain that `latest` backup folder can be accessed
more frequently than others. Hence, it is recommended to keep it
on storages with low access price.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Before, this cache was limited only by size.
Cache invalidation by time happens with jitter to prevent thundering herd problem.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
It was calculating the number of dropped time series instead of the number of dropped samples.
While at it, drop vmalert_remotewrite_dropped_bytes_total metric, since it was inconsistently calculated -
at one place it was calculating raw protobuf-encoded sample sizes, while at another place it was calculating
the size of snappy-compressed prompbmarshal.WriteRequest protobuf message.
Additionally, this metric has zero practical sense, so just drop it in order to reduce the level of confusion.
* update link to book a demo for AD & RCA
* fix invalid refs in components/model
* - fix staging -> prod links
- replace capitalized FAQ headers
- change the section order on main page
- replace :latest tag with current stable
* add panels for detailed visualization of traffic usage between vmstorage, vminsert, vmselect
components and their clients. New panels are available in the rows dedicated to specific components.
* update "Slow Queries" panel to show percentage of the slow queries to the total number of read queries
served by vmselect. The percentage value should make it more clear for users whether there is a service degradation.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* app/vmselect: drop `rollupDefault` function as duplicate
It is unclear why there are two identical fns `rollupDefault`
and `rollupDistinct`. Dropping one of them.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Update app/vmselect/promql/rollup.go
* Update app/vmselect/promql/rollup.go
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
The user may which to control the endpoint parameters for instance to
set the audience when requesting an access token. Exposing the
parameters as a map allows for additional use cases without requiring
modification.
* lib/awsapi: properly assume role with webIdentity token
introduce new irsaRoleArn param for config. It's only needed for authorization with webIdentity token.
First credentials obtained with irsa role and the next sts assume call for an actual roleArn made with those credentials.
Common use case for it - cross AWS accounts authorization
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3822
* wip
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Some clients may ingest samples via OpenTelemetry protocol without Resource labels.
Previously VictoriaMetrics was silently dropping such samples.
The commit 317834f876 added vm_protoparser_rows_dropped_total{type="opentelemetry",reason="resource_not_set"}
counter for tracking of such dropped samples. See https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5459
It is better from usability PoV to accept such samples instead of dropping them and incrementing the corresponding counter.
Before, retries happened only on writes into a network connection
between source and destination. But errors returned by server after
all the data was transmitted were logged, but not retried.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Currently, it is impossible to understand why metrics are not ingested when resource is not set by OTEL exporter. Adding metric should simplify debugging and make it improve debuggability.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
app/vmselect/promql/eval.go:evalAggrFunc shunts evaluation
of AggrFuncExpr over rollupFunc over MetricsExpr to an optimized
path. tryGetArgRollupFuncWithMetricExpr() checks whether expression
can be shunted, but it mangles the AggrFuncExpr when the aggregation
function has more than one argument. This results in queries like
`sum(aggr_over_time("avg_over_time",m))` failing with error message
'expecting at least 2 args to "aggr_over_time"; got 1 args' while
the analogous query `sum(avg_over_time(m))` executes successfully.
This fix removes the unnecessary mangling.
Signed-off-by: Anton Tykhyy <atykhyy@gmail.com>
Previously `retry_status_codes: []` and `drop_src_path_prefix_parts: 0` at `url_map` were equivalent to missing values.
This was resulting in using the user-level values instead.
* vmui: add show quick tip for autocomplete
* vmui: auto-completion usability improvements #5348
* vmui: add const for min symbols in autocomplete
* Use proper queries to VictoriaMetrics
* vmui: fix comments for autocomplete
* app/vmselect: run `make vmui-update`
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* fix typos in docs
* add `shard-` prefix to generated links when `-promscrape.cluster.memberURLTemplate` is enabled
Signed-off-by: hagen1778 <roman@victoriametrics.com>
This is follow-up after
75196d7234
It updates some of the alerting rules to remove unnecessary aggregations.
It keeps aggregations for expressions which are using multiple time series
filters to make sure their label will match.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Aggregations with by() have one sideeffect, that any custom labels you add to hosts are dropped too which can be used for alerts routing.
Therefore, some good practice could be to use without() instead, with labels, like without(path) , or without(url) to get same aggregations but with any external labels left intact.
Before, vmalert would send notifications with labels containing characters
not supported by Alertmanager validator, resulting into validation errors
like `msg="Failed to validate alerts" err="invalid label set: invalid name "foo.bar"`
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Previously the lower bound could be too small, which could result in missing values at the beginning of the graph
for default_rollup() function. This function is automatically applied to all the series selectors if they aren't
explicitly wrapped into a rollup function - see https://docs.victoriametrics.com/MetricsQL.html#implicit-query-conversions
While at it, properly take into account `-search.minStalenessInterval` command-line flag when adjusting
the lower bound for the selected time range.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5388
* vmagent: export `vm_promscrape_scrape_pool_targets` metric to track the number of targets that each scrape_job discovers
* add extra panel for new metric
- Add links to relevant docs into descriptions for every -kafka.* and -gcp.pubsub.* command-line flags.
- Wait until message processing goroutines are stopped before returning from gcppubsub.Stop().
- Prevent from multiple calls to Init() without Stop().
- Drop message if tenantID cannot be parsed properly.
- Take into account tenantID for all the supported message formats.
- Support gzip-compressed messages for graphite format.
- Use exponential backoff sleep when the message cannot be pushed to remote storage systems
because of disabled on-disk persistence - https://docs.victoriametrics.com/vmagent.html#disabling-on-disk-persistence
- Unblock from sleep as soon as Stop() is called. Previously the sleep could take up to 2 seconds after Stop() is called.
- Remove unused globalCtx and initContext from app/vmagent/remotewrite/gcppubsub
- Mention Google PubSub support at docs/enterprise.md
- Make Google PubSub docs more clear at docs/vmagent.md
This is a follow-up for commits 115245924a5f096c5a3383d6cc8e8b6fbd421984
and e6eab781ce42285a6a1750dc01eba6801dd35516 .
Updates https://github.com/VictoriaMetrics/VictoriaMetrics-enterprise/pull/717
Updates https://github.com/VictoriaMetrics/VictoriaMetrics-enterprise/pull/713
* app/vmalert: expose `/vmalert/api/v1/rule` and `/api/v1/rule` API which returns rule status in JSON format
* app/vmalert: hide updates if query param not set
* app/vmalert: fix panic (recursion call)
* app/vmalert: add needed group name and file name
* app/vmalert: fix comment, update behavior
* app/vmalert: fix description
* app/vmalert: simplify API for /api/v1/rule
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* app/vmalert: simplify API for /api/v1/rule
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* app/vmalert: simplify API for /api/v1/rule
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* app/vmalert: simplify API for /api/v1/rule
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* app/vmalert: simplify API for /api/v1/rule
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
E.g. replace `fs.Dir + filePath` with `path.Join(fs.Dir, filePath)`
The fs.Dir is guaranteed to end with slash - see Init() functions.
The filePath may start with slash. If it starts with slash, then `fs.Dir + filePath` constructs
an incorrect path with double slashes.
path.Join() properly substitutes duplicate slashes with a single slash in this case.
While at it, also substitute incorrect usage of filepath.Join() with path.Join()
for constructing paths to object storage systems, which expect forward slashes in paths.
filepath.Join() substittues forward slashes with backslashes on Windows, so this may break
creating or managing backups from Windows.
This is a follow-up for 0399367be602b577baf6a872ca81bf0f99ba401b
Updates https://github.com/VictoriaMetrics/VictoriaMetrics-enterprise/pull/719
* lib/backup/s3remote: remove prev object versions for recursive delete
- fix error caused by sending empty objects list to be deleted. This was possible in case old versions of objects where deleted, but root-level entries where still available. This caused paginator to return an empty page which wasn't skipped.
- delete previous versions of objects recursively for S3 remote
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* docs/changelog: add vmbackupmanager fix entry
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/backup/s3remote: unify path construction for S3 objects
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Previously concurrency for static and fast queries was limited with the -search.maxConcurrentRequests
command-line flag. This could complicate identifying heavy queries via `vmui` at `Top queries` and `Active queries` pages,
since `vmui` and these pages couldn't be opened on overloaded vmselect.
Thanks to @f41gh7 for the idea.
Previously the /service-discovery page didn't show targets dropped because of sharding
( https://docs.victoriametrics.com/vmagent.html#scraping-big-number-of-targets ).
Show also the reason why every target is dropped at /service-discovery page.
This should improve debuging why particular targets are dropped.
While at it, do not remove dropped targets from the list at /service-discovery page
until the total number of targets exceeds the limit passed to -promscrape.maxDroppedTargets .
Previously the list was cleaned up every 10 minutes from the entries, which weren't updated
for the last minute. This could complicate debugging of dropped targets.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5389
* lib/streamaggr: properly reference slice with labels
by limiting slice capacity. It must fix issues with slice modification, in case of append new slice will be allocated, instead of modifying refrenced slice
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5402
* Reduce memory allocations when output_relabel_configs adds new labels to output samples
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* prevent /api/v1 from panic on parsing rows
* add tests for Extract function for v1 and v2 api's
* separate request types in different pools to prevent different objects mixing
* add changelog line
543f218fe9
Signed-off-by: hagen1778 <roman@victoriametrics.com>
- Add Try* prefix to functions, which return bool result in order to improve readability and reduce the probability of missing check
for the result returned from these functions.
- Call the adjustSampleValues() only once on input samples. Previously it was called on every attempt to flush data to peristent queue.
- Properly restore the initial state of WriteRequest passed to tryPushWriteRequest() before returning from this function
after unsuccessful push to persistent queue. Previously a part of WriteRequest samples may be lost in such case.
- Add -remoteWrite.dropSamplesOnOverload command-line flag, which can be used for dropping incoming samples instead
of returning 429 Too Many Requests error to the client when -remoteWrite.disableOnDiskQueue is set and the remote storage
cannot keep up with the data ingestion rate.
- Add vmagent_remotewrite_samples_dropped_total metric, which counts the number of dropped samples.
- Add vmagent_remotewrite_push_failures_total metric, which counts the number of unsuccessful attempts to push
data to persistent queue when -remoteWrite.disableOnDiskQueue is set.
- Remove vmagent_remotewrite_aggregation_metrics_dropped_total and vm_promscrape_push_samples_dropped_total metrics,
because they are replaced with vmagent_remotewrite_samples_dropped_total metric.
- Update 'Disabling on-disk persistence' docs at docs/vmagent.md
- Update stale comments in the code
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5088
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2110
* app/vmagent: allow to disabled on-disk queue
Previously, it wasn't possible to build data processing pipeline with a
chain of vmagents. In case when remoteWrite for the last vmagent in the
chain wasn't accessible, it persisted data only when it has enough disk
capacity. If disk queue is full, it started to silently drop ingested
metrics.
New flags allows to disable on-disk persistent and immediatly return an
error if remoteWrite is not accessible anymore. It blocks any writes and
notify client, that data ingestion isn't possible.
Main use case for this feature - use external queue such as kafka for
data persistence.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2110
* adds test, updates readme
* apply review suggestions
* update docs for vmagent
* makes linter happy
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Tests showed that importing a single line with 70MB size takes 5.3GiB
RSS memory for VictoriaMetrics single-node.
In the scenario when user exports and imports data from one VM to another,
it could possibly lead to OOM exception for destination VM.
Importing a single line with 16MB size taks 1.3GiB RSS memory.
Hence, the limit for `import.maxLineLen` was decreased from 100MB to 10MB
to improve reliability of VictoriaMetrics during imports.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
The default caching for Go artifacts from actions/setup-go@v4 uses the hash of go.sum file
as a cache key - see https://github.com/actions/cache/blob/main/examples.md#go---modules .
This isn't enough for VictoriaMetrics case, since different makefile actions build different the Go artifacts,
which need to be cached. So embed the action name in the cache key.
This case is possible after the following steps:
1. vmagent successfully performed handshake with the -remoteWrite.url and the remote storage supports zstd-compressed data.
2. remote storage became unavailable or slow to ingest data, vmagent compressed the collected data into blocks with zstd and puts these blocks to persistent queue on disk.
3. vmagent restarts and the remote storage is unavailable during the handshake, then vmagent falls back to Snappy compression.
4. vmagent starts sending zstd-compressed data from persistent queue to the remote storage, while falsely advertizing it sends Snappy-compressed data.
5. The remote storage receives zstd-compressed data and fails unpacking it with Snappy.
The solution is the same as 12cd32fd75, just fall back to zstd decompression if Snappy decompression fails.
Previously the number of memory allocations inside copyTimeseriesShallow() was equal to 1+len(tss)
Reduce this number to 2 by pre-allocating a slice of timeseries structs with len(tss) length.
The `path!="/favicon.ico"` filter has little sense, since there are many other special paths,
which may be filtered out - /metrics, /flags, /health, /ping, /robots.txt, /-/healthy, /-/ready, /reload, etc.
See /lib/httpserver/httpserver.go for more details.
It will be hard or impossible to maintain filters for all these paths, so it is better to drop this filter
in order to simplify queries and improve the consistency of these queries.
The buffered connection could have exceeded the underlying connection
deadline during reading or writing to an internal buffer.
With this change, buffered connection struct additionally checks
for a deadline in Read/Write methods.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
evalRollupFuncNoCache() may return time series with identical labels (aka duplicate series)
when performing queries satisfying all the following conditions:
- It must select time series with multiple metric names. For example, {__name__=~"foo|bar"}
- The series selector must be wrapped into rollup function, which drops metric names. For example, rate({__name__=~"foo|bar"})
- The rollup function must be wrapped into aggregate function, which has no streaming optimization.
For example, quantile(0.9, rate({__name__=~"foo|bar"})
In this case VictoriaMetrics shouldn't return `cannot merge series: duplicate series found` error.
Instead, it should fall back to query execution with disabled cache.
Also properly store the merged results. Previously they were incorrectly stored because of a typo
introduced in the commit 41a0fdaf39
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5332
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5337
`version` label won't show the difference if various flavors of the same
version were deployed. But `short_version` will.
For example, on the sandbox env we test VM builds before new version release.
Without this change, the version update won't be visible on dashboard.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* lib/querytracer: makes package concurrent safe to use
it must fix various issues with concurrent code usage.
Especially, when it's not reasonable to wait for all goroutines to be finished
* wip
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
- If min_over_time(m[offset] @ timestamp) <= min_over_time(m[offset] @ (timestamp-window)),
then the optimization can be applied.
- If max_over_time(m[offset] @ timestamp) >= max_over_time(m[offset] @ (timestamp-window)),
then the optimization can be applied.
* vmui: reduced the number of server requests
* run `make vmui-update vmui-logs-update`
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
This case is possible after the following steps:
1. vmagent tries to perform handshake with the -remoteWrite.url in order to determine whether
the remote storage supports zstd-compressed data.
2. The remote storage is unavailable during the handshake. In this case vmagent falls back to Snappy compression
for the data sent to the remote storage.
3. vmagent compresses the collected data into blocks with Snappy and puts these blocks to persistent queue on disk.
4. The remote storage becomes available.
5. vmagent restarts, performs the handshake with the remote storage and detects that it supports zstd-compressed data.
6. vmagent starts sending Snappy-compressed data from persistent queue to the remote storage,
while falsely advertizing it sends zstd-compressed data.
7. The remote storage receives Snappy-compressed data and fails unpacking it with zstd.
The solution is to just fall back to Snappy decompression if zstd decompression fails.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5301
We have failed test on master branch.
```
--- FAIL: TestFormatLogMessage (0.00s)
logger_test.go:24: unexpected result; got
"foo: abcde, \"foo bar baz\", xx"
want
"foo: a..e, \"f..z\", xx"
```
if failed because maxArgs maxLen <= 4 in the `LimitStringLen` in that case we always will return the income string
but in the test we limit the maxLen by value 4
```
f("foo: %s, %q, %s", []interface{}{"abcde", fmt.Errorf("foo bar baz"), "xx"}, 4, `foo: a..e, "f..z", xx`)
Previously the `Host` header was remained unchanged when passing it in requests to backends.
This may improperly work if the backend uses host-based routing.
While at it, allows http/2.0 requests to backends. While VictoriaMetrics components
do not accept http/2.0 requests, other backends can require such requests.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5240
- Re-use identically configured http.Transport across multiple users.
This fixes handling of the limit on the number of connection, which can be established per each backend
via -maxIdleConnsPerBackend command-line flag. This limit stopped working after 323f3720ed
- Add docs about backend TLS setup at https://docs.victoriametrics.com/vmauth.html#backend-tls-setup
- Add ability to disable backend TLS verification for all the users via -backend.tlsInsecureSkipVerify command-line flag.
This flag may be useful when -auth.config contains big number of users, and every user must disable backend TLS verification.
- Add ability to specify TLS Root CA via tls_ca_file option at per-user basis and via -backend.tlsCAFile command-line flag
across all the users.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5240
Previously entries which were accessed only 1 time weren't cached.
It has been appeared that some rarely executed heavy queries may read indexdb block twice
in a row instead of once. There is no need in caching such a block then.
This change should eliminate cache size spikes for indexdb/dataBlocks when such heavy queries are executed.
Expose -blockcache.missesBeforeCaching command-line flag, which can be used for fine-tuning
the number of cache misses needed before storing the block in the caching.
* app/vmalert: update remote-write process
* automatically retry remote-write requests on closed connections. The change should reduce the amount of logs produced in environments with short-living connections or environments without support of keep-alive on network balancers.
* increment `vmalert_remotewrite_errors_total` metric if all retries to send remote-write request failed. Before, this metric was incremented only if remote-write client's buffer is overloaded.
* increment `vmalert_remotewrite_dropped_rows_total` amd `vmalert_remotewrite_dropped_bytes_total` metrics if remote-write client's buffer is overloaded. Before, these metrics were incremented only after unsuccessful HTTP calls.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Update docs/CHANGELOG.md
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Hui Wang <haley@victoriametrics.com>
reduce the number of queries for restoring alerts state on start-up.
The change should speed up the restore process and reduce pressure on `remoteRead.url`.
* vmauth: add browser authorization request for http requests without credentials to a route that is not in the `unauthorized_user` section (when `unauthorized_user` is specified).
* add link to issue in CHANGELOG
* Extend vmauth docs
* wip
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
This reduction is based on production testing.
Also expose -search.minWindowForInstantRollupOptimization command-line flag, so users could fine-tune this arg for their needs
vmalert expects string value for stats.seriesFetched, so it is impossible
switching to number without breaking compatibility with old vmalert releases :(
It is still unclear why stats.seriesFetched has string type in the first place...
Repeated instant queries with long lookbehind windows, which contain one of the following rollup functions,
are optimized via partial result caching:
- sum_over_time()
- count_over_time()
- avg_over_time()
- increase()
- rate()
The basic idea of optimization is to calculate
rf(m[d] @ t)
as
rf(m[offset] @ t) + rf(m[d] @ (t-offset)) - rf(m[offset] @ (t-d))
where rf(m[d] @ (t-offset)) is cached query result, which was calculated previously
The offset may be in the range of up to 1 hour.
The SECURITY label should be applied only to changes, which fix security issues.
The change at ad839aa492 adds new command-line flags, which can be used
for improving security in some cases. They do not fix any security issues.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5111
The new metric gets increased each time `-search.logQueryMemoryUsage` memory limit
is exceeded by a query. This metric should help to identify expensive and heavy queries
without inspecting the logs.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The new rule for vmalert supposed to detect groups that miss their
evaulations due to slow queries.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The panel `Errors rate to Alertmanager` had `group` label filter
applied to the expression, while the metric `vmalert_alerts_send_errors_total`
doesn't have that label. This resulted into always empty results.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
fix possible missing firing states for alerting rules in replay mode
Before if one firing stage is bigger than single query request range, like rule with a big `for`, alerting rule won't able to be detected as firing.
Co-authored-by: hagen1778 <roman@victoriametrics.com>
support `Strict-Transport-Security`, `Content-Security-Policy` and `X-Frame-Options`
HTTP headers in all VictoriaMetrics components.
The values for headers can be specified by users via the following flags:
`-http.header.hsts`, `-http.header.csp` and `-http.header.frameOptions`.
Co-authored-by: hagen1778 <roman@victoriametrics.com>
This allows substituting FATAL panics with recoverable runtime errors such as missing or invalid TLS CA file
and/or missing/invalid /var/run/secrets/kubernetes.io/serviceaccount/namespace file.
Now these errors are logged instead of PANIC'ing, so they can be fixed by updating the corresponding files
without the need to restart vmagent.
This is a follow-up for 90427abc65
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/5243
Previously url watchers for pod, service and node objects could be mistakenly closed
when service discovery was set up only for endpoints and endpointslice roles,
since watchers for these roles may start start pod, service and node url watchers
with nil apiWatcher passed to groupWatcher.startWatchersForRole().
Now all the url watchers, which belong to a particular groupWatcher, are stopped at once
when this groupWatcher has no apiWatcher subscribers.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5216
The issue has been introduced in v1.93.5 when addressing https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850
* do not print redundant error logs when failed to scrape consul or nomad target
prometheus performs the same because it uses consul lib which just drops the error(1806bcb38c/api/api.go (L1134))
- Make sure that invalid/missing TLS CA file or TLS client certificate files at vmagent startup
don't prevent from processing the corresponding scrape targets after the file becomes correct,
without the need to restart vmagent.
Previously scrape targets with invalid TLS CA file or TLS client certificate files
were permanently dropped after the first attempt to initialize them, and they didn't
appear until the next vmagent reload or the next change in other places of the loaded scrape configs.
- Make sure that TLS CA is properly re-loaded from file after it changes without the need to restart vmagent.
Previously the old TLS CA was used until vmagent restart.
- Properly handle errors during http request creation for the second attempt to send data to remote system
at vmagent and vmalert. Previously failed request creation could result in nil pointer dereferencing,
since the returned request is nil on error.
- Add more context to the logged error during AWS sigv4 request signing before sending the data to -remoteWrite.url at vmagent.
Previously it could miss details on the source of the request.
- Do not create a new HTTP client per second when generating OAuth2 token needed to put in Authorization header
of every http request issued by vmagent during service discovery or target scraping.
Re-use the HTTP client instead until the corresponding scrape config changes.
- Cache error at lib/promauth.Config.GetAuthHeader() in the same way as the auth header is cached,
e.g. the error is cached for a second now. This should reduce load on CPU and OAuth2 server
when auth header cannot be obtained because of temporary error.
- Share tls.Config.GetClientCertificate function among multiple scrape targets with the same tls_config.
Cache the loaded certificate and the error for one second. This should significantly reduce CPU load
when scraping big number of targets with the same tls_config.
- Allow loading TLS certificates from HTTP and HTTPs urls by specifying these urls at `tls_config->cert_file` and `tls_config->key_file`.
- Improve test coverage at lib/promauth
- Skip unreachable or invalid files specified at `scrape_config_files` during vmagent startup, since these files may become valid later.
Previously vmagent was exitting in this case.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4959
It could return either `failed to read` or `failed to parse` errors depending
on whether the given url can be loaded or not under the current environment
c9375cac5e
Descriptions were updated in attempt to make it more clear for readers,
re-phrasing and linking missing docs.
`eval_delay` was added to tests to verify it can be unmarshalled.
`eval_delay` is now applied before timestamp alignment to make it more predictable.
Before, if delay < interval the timestamp won't be aligned.
`eval_delay` and `eval_offset` was added to API output.
`PreviouslySentSeriesToRW` converted to private `previouslySentSeriesToRW`.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Using `min_over_time` should reduce the amount of false positives when
component is running in near-the-threshold state. Now it should trigger
only if all collected samples were above the threshold on 10m interval.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* app/vmselect: limit the number of parallel workers by 32
The change should improve performance and memory usage during query processing
on machines with big number of CPU cores. The number of parallel workers for
query processing is controlled via `-search.maxWorkersPerQuery` command-line flag.
By default, the number of workers is limited by the number of available CPU cores,
but not more than 32. The limit can be increased via `-search.maxWorkersPerQuery`.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* wip
- The `-search.maxWorkersPerQuery` command-line flag doesn't limit resource usage,
so move it from the `resource usage limits` to `troubleshooting` chapter at docs/Single-server-VictoriaMetrics.md
- Make more clear the description for the `-search.maxWorkersPerQuery` command-line flag
- Add the description of `-search.maxWorkersPerQuery` to docs/Cluster-VictoriaMetrics.md
- Limit the maximum value, which can be passed to `-search.maxWorkersPerQuery`, to GOMAXPROCS,
because bigger values may worsen query performance and increase CPU usage
- Improve the the description of the change at docs/CHANGELOG.md. Mark it as FEATURE instead of BUGFIX,
since it is closer to a feature than to a bugfix.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5087
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* fix inconsistent behaviors with prometheus when scraping
1. address https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4959. skip job with wrong syntax in `scrape_configs` with error logs instead of exiting;
2. show error messages on vmagent /targets ui if there are wrong auth configs in `scrape_configs`, previously will print error logs and do scrape without auth header;
3. don't send requests if there are wrong auth configs in:
1. vmagent remoteWrite;
2. vmalert datasource/remoteRead/remoteWrite/notifier.
* add changelogs
* address review comments
* fix ut
This can be useful in the following queries:
drop_empty_series(temperature <= 30) default 40
This query drops temperature series with all the values bigger than 30 on the selected time range,
while replacing gaps in the remaining series with 40.
The query without drop_empty_series:
(temperature <= 30) default 40
would leave all the temperature series with all the values bigger than 30 on the selected time range,
and replace all their values with 40. This is not what could be epxected in some cases
like here - https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5071
This reverts commit 3d7a77bf82.
Reason for revert: relative links do not work properly at GitHub code
and at GitHub wiki. For example, the following page contains broken links
before reverting this commit:
https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/docs/VictoriaLogs/CHANGELOG.md
It is always better to use absolute links thank relative links, since the page contents
can be copy-n-pasted to other pages, which are located in vastly different directories,
and all the links will remain working.
* fixed error when creating a full backup using the `-origin` flag (#5144)
* Update docs/CHANGELOG.md
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
This is a follow-up for f60c08a7bd
Changes:
- Make sure all the urls related to NewRelic protocol start from /newrelic . Previously some urls were started from /api/v1/newrelic
- Remove /api/v1 part from NewRelic urls, since it has no sense
- Remove automatic transformation from CamelCase to snake_case for NewRelic labels and metric names,
since it may complicate the transition from NewRelic to VictoriaMetrics. Preserve all the metric names and label names,
so users could query metrics and labels by the same names which are used in NewRelic.
The automatic transformation from CamelCase to snake_case can be added later as a special action for relabeling rules if needed.
- Properly update per-tenant data ingestion stats at app/vmagent/newrelic/request_handler.go . Previously it was always zero.
- Fix NewRelic urls in vmagent when multitenant data ingestion is enabled. Previously they were mistakenly started from `/`.
- Document NewRelic data ingestion url at docs/Cluster-VictoriaMetrics.md
- Remove superflouos memory allocations at lib/protoparser/newrelic
- Improve tests at lib/protoparser/newrelic/*
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3520
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4712
- Make more clear the docs at docs/enterprise.md, so readers could figure out faster
on how to obtain enterprise key and how to pass it to VictoriaMetrics Enterprise components.
- Fix examples at docs/enterprise.md, which were referring to non-existing `-license-file` command-line flag.
The `-licenseFile` command-line flag must be used instead.
- Improve the description of `-license*` command-line flags, so users could understand
faster how to use them.
- Improve the warning message, which is emitted when the deprecated -eula command-line flag is passed,
so the user could figure out how to switch faster to -license* command-line flags.
- Disallow running VictoriaMetrics components with both -license and -licenseFile command-line flags.
- Disallow running VictoriaMetrics components when -licensFile points to an empty file.
- Consistently use the phrase "This flag is available only in Enterprise binaries" across
all the enterprise-specific command-line flags.
- Remove unneeded level of indirection for `noLicenseMessage` and `expiredMessage` string contants
in order to improve code readability and maintainability.
- Remove unneded `return` statements after `logger.Fatalf()` calls, since these calls exit the app and never return.
- Make sure that the info log message about successful license verification is emitted
when the license is verified successfully. Previously the error message could be logged
when the license payload is invalid or if it misses some required features.
This reverts commit a8345bb1b9
Reason for revert: VictoriaMetrics binaries are consistently created inside `bin` directory at the root of the repository
when running `make <vm-app>` according to https://docs.victoriametrics.com/#how-to-build-from-sources
If some dev environments create binaries inside random directories, then it is better to provide docs
at https://docs.victoriametrics.com/#how-to-build-from-sources on how to setup these IDEs, so they
consistently create binaries at bin/* directory at the root of the repository instead of trying to add
random ignore rules inside .gitignore.
As for the data directories created by VictoriaMetrics components, they may be created at random places too,
so there is little sense in trying to add ignore rules for all these directories inside .gitignore.
It is better to document that the built binaries must be consistently started from the repository root,
so data directories are created at the repository root. The .gitignore already contains rule
for blocking common data directories, which can be created by VictoriaMetrics components at the repository root.
- Reduce vertical space usage, so more information is available on the screen without the need to scroll.
- Show information for lines with higher values at the top of the legend under the graph.
This should simplify graph analysis when it contains many lines.
Fix vminsert/vmstorage/vmselect metrics filtering when dashboard is used
to display data from many sub-clusters with unique job names.
Before, only one specific job could have been accounted for component-specific panels,
instead of all available jobs for the component.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
reduce lock contention for heavy aggregation requests
previously lock contetion may happen on machine with big number of CPU due to enabled string interning. sync.Map was a choke point for all aggregation requests.
Now instead of interning, new string is created. It may increase CPU and memory usage for some cases.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5087
* vmalert: add `query_time_alignment` for rule group
1. add `eval_alignment` attribute for group which by default is true. So group rule query stamp will be aligned with interval and propagated to ALERT metrics and the messages for alertmanager;
2. deprecate `datasource.queryTimeAlignment` flag.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5049
Strip sensitive information such as auth headers or passwords from datasource, remote-read,
remote-write or notifier URLs in log messages or UI. This behavior is by default and is controlled via
`-datasource.showURL`, `-remoteRead.showURL`, `remoteWrite.showURL` or `-notifier.showURL` cmd-line flags.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5044
* lib/promscrape: make concurrency control optional
Before, `-maxConcurrentInserts` was limiting all calls to `promscrape.Parse`
function: during ingestion and scraping. This behavior is incorrect.
Cmd-line flag `-maxConcurrentInserts` should have effect onl on ingestion.
Since both pipelines use the same `promscrape.Parse` function, we extend it
to make concurrency limiter optional. So caller can decide whether concurrency
should be limited or not.
This commit makes c53b5788b4
obsolete.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Revert "dashboards: move `Concurrent inserts` panel to Troubleshooting section"
This reverts commit c53b5788b4.
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
- Move uniqueFields from rows to blockStreamMerger struct.
This allows localizing all the references to uniqueFields inside blockStreamMerger.mustWriteBlock(),
which should improve readability and maintainability of the code.
- Remove logging of the event when blocks cannot be merged because they contain more than maxColumnsPerBlock,
since the provided logging didn't provide the solution for the issue with too many columns.
I couldn't figure out the proper solution, which could be helpful for end user,
so decided to remove the logging until we find the solution.
This commit also contains the following additional changes:
- It truncates field names longer than 128 chars during logs ingestion.
This should prevent from ingesting bogus field names.
This also should prevent from too big columnsHeader blocks,
which could negatively affect search query performance,
since columnsHeader is read on every scan of the corresponding data block.
- It limits the maximum length of const column value to 256.
Longer values are stored in an ordinary columns.
This helps limiting the size of columnsHeader blocks
and improving search query performance by avoiding
reading too long const columns on every scan of the corresponding data block.
- It deduplicates columns with identical names during data ingestion
and background merging. Previously it was possible to pass columns with duplicate names
to block.mustInitFromRows(), and they were stored as is in the block.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4762
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4969
docs: add clarification of the retention filter usage
Updated documentation regarding retention filter usage if duration is set lower than
`-retentionPeriod` flag value.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
* lib/promscrape: add metric `vm_promscrape_scrapes_skipped_total`
add metric `vm_promscrape_scrapes_skipped_total`to show whether vmagent skips the scrapes.
This could happen if vmagent is overloaded or target is responding too slow for configured `scrape_interval`.
The follow-up commit should add a corresponding alerting rule and panel to vmagent dashboard.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* deployment/docker: add `TooManyScrapeSkips` alerting rule for vmagent
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards: add panels `Scrape duration 0.99 quantile` and `Skipped scrapes` to vmagent dashboard
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
- Compare the actual free disk space to the value provided via -storage.minFreeDiskSpaceBytes
directly inside the Storage.IsReadOnly(). This should work fast in most cases.
This simplifies the logic at lib/storage.
- Do not take into account -storage.minFreeDiskSpaceBytes during background merges, since
it results in uncontrolled growth of small parts when the free disk space approaches -storage.minFreeDiskSpaceBytes.
The background merge logic uses another mechanism for determining whether there is enough
disk space for the merge - it reserves the needed disk space before the merge
and releases it after the merge. This prevents from out of disk space errors during background merge.
- Properly handle corner cases for flushing in-memory data to disk when the storage
enters read-only mode. This is better than losing the in-memory data.
- Return back Storage.MustAddRows() instead of Storage.AddRows(),
since the only case when AddRows() can return error is when the storage is in read-only mode.
This case must be handled by the caller by calling Storage.IsReadOnly()
before adding rows to the storage.
This simplifies the code a bit, since the caller of Storage.MustAddRows() shouldn't handle
errors returned by Storage.AddRows().
- Properly store parsed logs to Storage if parts of the request contain invalid log lines.
Previously the parsed logs could be lost in this case.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4737
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4945
This should reduce tail latency during data ingestion.
This shouldn't slow down data ingestion in the worst case, since assisted merges are spread among
distinct addRows/addItems calls after this change.
It was added in order to limit number of goroutines performing assisted merges during ingestion.
It turned out that blocking ingestion goroutines lower ingestion performance and limits overall ingestion around 40k items per seconds because of lock contention.
Removing parts merge sync.Cond allows to remove lock contention at write path and significantly improves write performance.
See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4775
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* vmui: update information about tsdb usage in cluster version
* vmui: cleanup
* vmui: add CHANGELOG.md
* vmui: cleanup
* vmui: update logic, move information to the visible place
* app/vmui: remove values fetch, update documentation for cardinality explorer
* app/vmui: update CHANGELOG.md
Moved because this panel is related to both: scraped and ingested data.
Before, it could have give a misleading impression that it is related to ingested metrics only.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* docker-compose: add vmauth to cluster env
vmauth acts as a balancer and used as an example of how to interconnect
VM components via vmauth.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* docker-compose: add vmauth to cluster env
vmauth acts as a balancer and used as an example of how to interconnect
VM components via vmauth.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
lib/promscrape/discovery/kubernetes: supress context.Cancelled error in logs
It is possible that context.Cancelled will appear after k8s watcher was closed due to reload(see https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850).
Logging an error misinforms user and looks like vmagent discovery will stop working even though this does not affect discovery.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* expose metrics `vmauth_config_last_reload_*` for tracking the state of config reloads, similarly to vmagent/vmalert components.
* do not print logs like `SIGHUP received...` once per configured `-configCheckInterval` cmd-line flag. This log will be printed only if config reload was invoked manually.
* prevent configuration reloading if there were no changes in config. This improves memory usage when `-configCheckInterval` cmd-line flag is configured and config has extensive list of regexp expressions requiring additional memory on parsing.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* lib/storage/partition: add check to ensure parts exist on disk
If part exists in parts.json but is missing on disk there will be a misleading error similar to "unexpected number of substrings in the part name".
This change forces verification of part existence and throws a correct error in case it is missing on disk.
Such issue can be result of https://github.com/VictoriaMetrics/VictoriaMetrics/issues/5005 or disk corruption.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/storage/partition: use filepath.Join instead of string concatenation
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/storage/partition: add action points for error message
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* all: add a check for missing part in lib/mergeset and lib/logstorage
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
- Switch from summary to histogram for vl_http_request_duration_seconds metric.
This allows calculating request duration quantiles across multiple hosts
via histogram_quantile(0.99, sum(vl_http_request_duration_seconds_bucket) by (vmrange)).
- Take into account only successfully processed data ingestion requests
when updating vl_http_request_duration_seconds histogram.
Failed requests are ignored, since they may significantly skew measurements.
- Clarify the description of the change at docs/VictoriaLogs/CHANGELOG.md.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4934
- Move the bugfix description to the correct place in docs/CHANGELOG.md
- Prevent from logging of 'context canceled' errors after the url watcher is stopped,
since these errors are expected and may confuse users.
- Remove unused urlWatcher.refCount field.
- Remove unused urlWatcher.close() method.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850
* lib/promscrape/discovery/kubernetes: fix leaking api watcher
goroutine which was polling k8s API had no execution control. This leaded to leaking goroutines during config reload.
See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4850
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/promscrape/discovery/kubernetes: use reference counting for urlWatcher cleanup
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/promscrape/discovery/kubernetes: remove waitgroup sync for goroutines polling API server
This is unnecessary since context will is cancelled and new requests will not be sent. Also, using waitgroup will increase time required to perform reload which might result in missed scrapes.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/promscrape/discovery/kubernetes: clarify comment
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* Apply suggestions from code review
* lib/promscrape/discovery/kubernetes: address review feedback
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
Adding limit on ingestion allows to avoid issues like this one https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4762
Such issues are often caused by misconfigurtion on log persing/ingestion side and preventing such rows from being ingested allows to avoid performance implications created by storing such log rows.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
This automation doesn't work as intended on LTS releases, bugfix releases and custom releases,
since it assumes every new tag is related only to new release.
Also the github.com/VictoriaMetrics/ops repository may contain manually set custom tags
for VictoriaMetrics components (for example, for testing the latest bugfixes or features),
which are overwritten by the generated pull request.
The way to go is to manually update tags at github.com/VictoriaMetrics/ops repository when needed
instead of trying to automate this process.
This should prevent from emitting too long lines when too long args are passed to logger.* functions.
For example, too long MetricsQL queries or too long data samples.
* Introduce flagutil.Duration
To avoid conversion bugs
* Fix tests
* Clarify documentation re. month=31 days
* Add fasttime.UnixTime() to obtain time.Time
The goal is to refactor out the last usage of `.Msecs`.
* Use fasttime for time.Now()
* wip
- Remove fasttime.UnixTime(), since it doesn't improve code readability and maintainability
- Run `make docs-sync` for syncing changes from README.md to docs/ folder
- Make lib/flagutil.Duration.Msec private
- Rename msecsPerMonth const to msecsPer31Days in order to be consistent with retention31Days
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* feat: add cardinality support for prometheus (#4320)
* docs/CHANGELOG.md: add cardinality support for prometheus
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* deployment/docker: add VictoriaLogs configuration
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* deployment/docker/victorialogs: remove outdated comment
It was added in order to indicate that it is required to build VictoriaLogs manually before starting it at the time there was no public release available.
Currently, there is a public tag and it is not required to build it from sources.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* deployment/docker/victorialogs/fluentbit: include log path in stream configuration
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* deployment/docker: add reference to monitoring setup for VictoriaLogs
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* added ability to set and clear response headers (#4825)
Signed-off-by: Alexander Marshalov <_@marshalov.org>
* added ability to set and clear response headers (#4825)
Signed-off-by: Alexander Marshalov <_@marshalov.org>
* fix review comment
Signed-off-by: Alexander Marshalov <_@marshalov.org>
---------
Signed-off-by: Alexander Marshalov <_@marshalov.org>
* app/vminsert: fixes readonly check
previously vminsert doesn't check readOnly state for vmstorage, since check was never performed for nil buffer
In this case every 30 second storage node loss readonly state and received some data.
It caused re-routing and possible slow down for ingestion
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4870
* wip
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Disallow parsing multitenant token at auth.NewToken().
Use auth.NewTokenPossibleMultitenant() at vminsert only. All the other callers should call auth.NewToken(),
since they do not support multitenant token.
This is a follow-up for f0c06b428e
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4910
- Document the change at docs/CHANGELOG.md
- Set the default value for -vmstorageUserTimeout to 3 seconds. This is much better
than the 0 value, which means that TCP connection to unreachable vmstorage could block
for up to 16 minutes.
- Document -vmstorageUserTimeout at docs/Cluster-VictoriaMetrics.md
The previous code could result in the following data race:
1. The s.ptwHot partition is marked to be deleted
2. ptw.decRef() is called on it
3. ptw.pt is set to nil
4. s.ptwHot.pt is accessed from concurrent goroutine, which leads to panic.
The change clears s.ptwHot under s.partitionsLock in order to prevent from the data race.
This is a follow-up for 8d50032dd6
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4895
`-promscrape.cluster.membersCount` by default should be `1`, like every
single vmagent is a cluster of one member on its own.
The change additionally validates that user can't set `-promscrape.cluster.membersCount`
to value lower than `1`.
eabcfc9bcd
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Initially, stream parse mode was reading data from response and parsing it on flight. This was causing longer delay to read the whole response and required increasing timeout value to allow data processing while reading. So that 908e35affd increased timeout value to fix this.
But after 74c00a8762 response in stream parse mode is saved into memory and then parsed eliminating necessity of having timeout value higher that for usual scrape.
Updates: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4847
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* vmagent: retry failed write request on the closed connection
Retry failed write request on the closed connection immediately,
without waiting for backoff. This should improve data delivery speed
and reduce amount of error logs emitted by vmagent when using idle connections.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4139
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmagent: retry failed write request on the closed connection
Re-instantinate request before retry as body could have been already spoiled.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
* vmalert: correctly re-instantinate HTTP req on retries
Previosly, request retry to datasource re-used existing HTTP request.
But if request object was already partially processed (body was read),
then retry will be unsuccessful.
The change re-instantinates HTTP request object before retry.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: review fix
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Add button to prettify query
Just capitalizes query text for now
* Add /prettify-query API handler
* Replace UI pretiffier using prettifier API
* Add showing server errors
Had to pass setQueryErrors from useFetchQuery.ts
* Use serverUrl from global AppState
* Change icon to AutoAwsome icon + added style change color when button is active
* Add sync/await to prettifyQuery function
* Doc public function for lint
* Minor async fix
* Removed extra blank lines
* Extract usePrettifyQuery hook
* Made more generic style for :active button
* Refactor usePrettifyQuery
However, prettify errors don't clean up query errors, but should
* Add prettyQuery functionality to CHANGELOG.md
* Reuse queryErrors
* Unhide errors on start
---------
Co-authored-by: Tamara <toma.vashchuk@gmail.com>
exclude assets/README.md from publishing on the docs website
as its purpose is different to other docs.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
- Fix Prometheus-compatible naming after applying the relabeling if -usePromCompatibleNaming command-line flag is set.
This should prevent from possible Prometheus-incompatible metric names and label names generated by the relabeling.
- Do not return anything from relabelCtx.appendExtraLabels() function, since it cannot change the number of time series
passed to it. Append labels for the passed time series in-place.
- Remove promrelabel.FinalizeLabels() call after adding extra labels to time series, since this call has been already
made at relabelCtx.applyRelabeling(). It is user's responsibility if he passes labels with double underscore prefixes
to -remoteWrite.label.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4247
This reverts commit 251d3d54a4.
Reason for revert: this commit removes important information, which was placed in the docs/assets folder
in order to reduce the probability of putting images related to particular docs here,
with the reasoning why this is a bad practice.
This information should remain in the docs/assets folder. Probably, the file should be renamed
from README.md to README or to some other visible name, which doesn't lead to generation of unneded
documentation for the assets folder.
app/vmctl: don't interrupt migration process if tenant has no data
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Alexander Marshalov <_@marshalov.org>
VictoriaLogs has its own release schedule, so it must be released separately via:
make publish-victoria-logs release-victoria-logs
TODO: sync VictoriaLogs and VictoriaMetrics releases after VictoriaLogs goes out of preview stage.
This will simplify release process and upgrades at user side.
vmagent: properly add extra labels before sending data to remote storage
labels from `remoteWrite.label` are now added to sent metrics just before they
are pushed to `remoteWrite.url` after all relabelings, including stream aggregation relabelings (#4247)
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4247
Signed-off-by: Alexander Marshalov <_@marshalov.org>
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
Fix display of ingested rows rate for `Samples ingested/s`
and `Samples rate` panels for vmagent's dasbhoard.
Previously, not all ingested protocols were accounted in these panels.
An extra panel `Rows rate` was added to `Ingestion` section to display the split
for rows ingested rate by protocol.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
This reverts commit 252643d100.
Reason for revert: the commit incorrectly fixes the the issue.
The `remoteAddr` must be properly quoted inside lib/httpserver.GetQuotedRemoteAddr().
It isn't quoted properly if the request contains X-Forwarded-For header.
The proper fix will be included in the follow-up commit.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4676
{vmagent/remotewrite,vminsert/common}: fix dropInput and keepInput flags inconsistency
Sync behavior for dropInput and keepInput flags between single-node and vmagent.
Fix vmagent not respecting dropInput flag and reverse logic for keepInput.
The deferred call's arguments are evaluated immediately, but the function call is not executed until the surrounding function returns.
Signed-off-by: Abirdcfly <fp544037857@gmail.com>
* rename `configErr` to `lastConfigErr` to reduce confusion
* add tests to verify metrics and msg are set properly
* fix mistake when config success metric wasn't restored after an error
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Value of `-dedup.minScrapeInterval` comand-line flag must be higher than `evaluation_interval` in order to make sure that only one sample on each evaluation will be left after deduplication.
Moreover, value of `-dedup.minScrapeInterval` must be a multiple of vmalert's `evaluation_interval` in order to make sure that samples will be aligned between deduplication window periods.
See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4774#issuecomment-1663940811
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Correctly calculate `Bytes per point` value for single-server and cluster VM dashboards.
Before, the calculation mistakenly accounted for the number of entries in indexdb in
denominator, which could have shown lower values than expected.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The `ConcurrentFlushesHitTheLimit` could be related to components like
vminsert, vmstorage, vm-single-node and vmagent. Moving this alert
to the `health` section of alerts will be benefitial for all components
and will remove the duplicates from single/cluster alerts.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The new panel supposed to show whether the number of concurrent
inserts processed by vmagent isn't reaching the limit.
The panel contains recommendation what to do if limit is reached.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
vmauth: allow configuring deadline for a backend to be excluded from the rotation
The new flag `-failTimeout` allows overriding default time for a bad backend
to be excluded from rotation. The override option could be useful for systems
where it is expected for backends to be off for significant periods of time.
Co-authored-by: Zakhar Bessarab <zekker6@gmail.com>
Binary export API protocol can be disabled via `-vm-native-disable-binary-protocol` cmd-line flag when migrating data from VictoriaMetrics. Disabling binary protocol
can be useful for deduplication of the exported data before ingestion.
For this, deduplication need to be configured at `-vm-native-src-addr` side
and `-vm-native-disable-binary-protocol` should be set on vmctl side.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* lib/protoparser: adds opentelemetry parser
app/{vmagent,vminsert}: adds opentelemetry ingestion path
Adds ability to ingest data with opentelemetry protocol
protobuf and json encoding is supported
data converted into prometheus protobuf timeseries
each data type has own converter and it may produce multiple timeseries
from single datapoint (for summary and histogram).
only cumulative aggregationFamily is supported for sum(prometheus
counter) and histogram.
Apply suggestions from code review
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
updates deps
fixes tests
wip
wip
wip
wip
lib/protoparser/opentelemetry: moves to vtprotobuf generator
go mod vendor
lib/protoparse/opentelemetry: reduce memory allocations
* wip
- Remove support for JSON parsing, since it is too fragile and is rarely used in practice.
The most clients send OpenTelemetry metrics in protobuf.
The JSON parser can be added in the future if needed.
- Remove unused code from lib/protoparser/opentelemetry/pb and lib/protoparser/opentelemetry/proto
- Do not re-use protobuf message between ParseStream() calls, since there is high chance
of high fragmentation of the re-used message because of too complex nested structure of the message.
* wip
* wip
* wip
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
The important change is to highlight that restore procedure happens
only once and only for already loaded rules. Config hot-reload
doesn't trigger the restore procedure.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Add -remoteWrite.shardByURL command-line flag, which instructs vmagent to spread evenly
outgoing time series data among the configured remote storage systems specified via -remoteWrite.url .
Samples for the same time series go to the same -remoteWrite.url . This allows building horizontally
scalable stream aggregation when samples for counter and histogram series must be aggregated
by the same second-level vmagent instance.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4637
- Use a byte slice instead of a map for tracking indexes for matching series.
This improves performance, since access by slice index is faster than access by map key.
- Re-use the byte slice for tracking indexes for matching series.
This removes unnecessary memory allocations and improves stream aggregation performance a bit.
- Add an ability to return to the previous behvaiour by specifying -remoteWrite.streamAggr.dropInput command-line flag.
In this case all the input samples are dropped when stream aggregation is enabled.
- Backport the new stream aggregation behaviour from vmagent to single-node VictoriaMetrics when -streamAggr.config
option is set.
- Improve docs regarding this change at docs/CHANGELOG.md
- Document the new behavior at docs/stream-aggregation.md
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4243
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4575
Previously all the incoming samples were de-duplicated, even if their series doesn't
match aggregation rule filters. This could result in increased CPU usage.
Now the de-duplication isn't applied to samples for series, which do not match
aggregation rule filters. Such samples are just ignored.
* lib/storage: pre-create timeseries before indexDB rotation
during an hour before indexDB rotation start creating records at the next indexDB
it must improve performance during switch for the next indexDB and remove ingestion issues.
Since there is no need for creation new index records for timeseries already ingested into current indexDB
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4563
* lib/storage: further work on indexdb rotation optimization
- Document the change at docs/CHAGNELOG.md
- Move back various caches from indexDB to Storage. This makes the change less intrusive.
The dateMetricIDCache now takes into account indexDB generation, so it stores (date, metricID)
entries for both the current and the next indexDB.
- Consolidate the code responsible for idbNext pre-filling into prefillNextIndexDB() function.
This improves code readability and maintainability a bit.
- Rewrite and simplify the code responsible for calculating the next retention timestamp.
Add various tests for corner cases of this code.
- Remove indexdb pre-filling from RegisterMetricNames() function, since this function is rarely called.
It is OK to add indexdb entries on demand in this function. This simplifies the code.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1401
* docs/CHANGELOG.md: refer to https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4563
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* app/vmalert/datasource/graphite: allow overriding "from" parameter for datasource queries
Fixes construction of URL parameters for graphite render to allow overriding "from" parameter.
See: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4685
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* app/vmalert/datasource/graphite: update flow for building URL parameters
Makes flow of building URL parameters same as Prometheus datasource has:
1) Setting all default values
2) Merging those values with provided `extraParams`
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* Update docs/CHANGELOG.md
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
The change disables version updates for repo packages.
Please note, security updates should not be affected by the change
according to https://docs.github.com/en/code-security/dependabot/dependabot-version-updates/configuration-options-for-the-dependabot.yml-file#open-pull-requests-limit:
```
open-pull-requests-limit
By default, Dependabot opens a maximum of five pull requests for version updates. Once there are five open pull requests from Dependabot, Dependabot will not open any new requests until some of those open requests are merged or closed.
This option has no impact on security updates, which have a separate, internal limit of ten open pull requests.
```
Signed-off-by: hagen1778 <roman@victoriametrics.com>
- Round staleness_interval durations to the upper number of seconds.
This should prevent from under-calculations for fractional staleness intervals.
- Rename stalenessInterval field at *AggrState structs into stalenessSecs, since it holds seconds.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4667
- Improve docs
- Hide `debug relabeling` column when -promscrape.dropOriginalLabels command-line flag is set
- Inline the code from the added template functions, since the code is harder to follow
with the template functions, especially when these functions have misleading names.
Also, these functions are used only in one place, e.g. they do not reduce the amounts of code.
- Hide `click to show original labels` title at `labels` column when original labels aren't available.
- Show the reason on whey original labels aren't available at /service-discovery page.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4597
- Parse protobuf if Content-Type isn't set to `application/json` - this behavior is documented at https://grafana.com/docs/loki/latest/api/#push-log-entries-to-loki
- Properly handle gzip'ped JSON requests. The `gzip` header must be read from `Content-Encoding` instead of `Content-Type` header
- Properly flush all the parsed logs with the explicit call to vlstorage.MustAddRows() at the end of query handler
- Check JSON field types more strictly.
- Allow parsing Loki timestamp as floating-point number. Such a timestamp can be generated by some clients,
which store timestamps in float64 instead of int64.
- Optimize parsing of Loki labels in Prometheus text exposition format.
- Simplify tests.
- Remove lib/slicesutil, since there are no more users for it.
- Update docs with missing info and fix various typos. For example, it should be enough to have `instance` and `job` labels
as stream fields in most Loki setups.
- Allow empty of missing timestamps in the ingested logs.
The current timestamp at VictoriaLogs side is then used for the ingested logs.
This simplifies debugging and testing of the provided HTTP-based data ingestion APIs.
The remaining MAJOR issue, which needs to be addressed: victoria-logs binary size increased from 13MB to 22MB
after adding support for Loki data ingestion protocol at https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4482 .
This is because of shitty protobuf dependencies. They must be replaced with another protobuf implementation
similar to the one used at lib/prompb or lib/prompbmarshal .
Loki uses default labels format without "or" operator. This format can't create a list of LabelFilters, so only first set of LabelFilters should be used.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* app/vmagent: fix creating target id if `--promscrape.dropOriginalLabels` flag was used
* app/vmagent: hide links if OriginalLabels was dropped
* app/vmagent: update CHANGELOG.md and added information to the docs
* app/vmagent: fix comments
* app/vlinsert: add support of loki push protocol
- implemented loki push protocol for both Protobuf and JSON formats
- added examples in documentation
- added example docker-compose
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* app/vlinsert: move protobuf metric into its own file
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* deployment/docker/victorialogs/promtail: update reference to docker image
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* deployment/docker/victorialogs/promtail: make volume name unique
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* app/vlinsert/loki: add license reference
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* deployment/docker/victorialogs/promtail: fix volume name
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* docs/VictoriaLogs/data-ingestion: add stream fields for loki JSON ingestion example
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* app/vlinsert/loki: move entities to places where those are used
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* app/vlinsert/loki: refactor to use common components
- use CommonParameters from insertutils
- stop ingestion after first error similar to elasticsearch and jsonline
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* app/vlinsert/loki: address review feedback
- add missing logstorage.PutLogRows calls
- refactor tenant ID parsing to use common function
- reduce number of allocations for parsing by reusing logfields slices
- add tests and benchmarks for requests processing funcs
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
This eliminates the need in .(*T) casting for results obtained from Load()
Leave atomic.Value for map, since atomic.Pointer[map[...]...] makes double pointer to map,
because map is already a pointer type.
- Add `Active queries` chapter to VMUI docs
- Set `Content-Type: json` header inside promql.WriteActiveQueries() handler,
in order to be consistent with other request handlers called at app/vmselect/main.go
- Pass the request to promql.WriteActiveQueries() handler, so it can change its output
depending on the provided request params. This also improves consistency of
promql.WriteActiveQueries() args with other request hanlers at app/vmselect/main.go
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4653
* feat: add page to display a list of active queries (#4598)
* app/vmagent: code formatting
* fix: remove console
---------
Co-authored-by: dmitryk-dk <kozlovdmitriyy@gmail.com>
Using `job=~$job_storage` forces "Cache usage" panel to display only vmstorage caches, but there is a cache peresent at vmselect(`promql/rollupResult`).
Updated selector to match generic `$job` so that all caches will be displayed with an option to display per-job caches.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
The following syntax is supported: _time:filter offset off
For example:
- _time:5m offset 1h - 5-minute duration one hour before the current time
- _time:2023 offset 2w - 2023 year with the 2 weeks offset in the past
The `a op b keep_metric_names` is ambigouos to `a op (b keep_metric_names)` when `b` is a transform or rollup function.
For example, `a + rate(b) keep_metric_names`. So it is better to use more clear syntax: `(a op b) keep_metric_names`
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3710
- Sort MetricName tags only once before the benchmark loop.
- Obtain indexSearch per each benchmark loop in order to give a chance for background merge
for the recently created parts
Previously all the newly ingested time series were registered in global `MetricName -> TSID` index.
This index was used during data ingestion for locating the TSID (internal series id)
for the given canonical metric name (the canonical metric name consists of metric name plus all its labels sorted by label names).
The `MetricName -> TSID` index is stored on disk in order to make sure that the data
isn't lost on VictoriaMetrics restart or unclean shutdown.
The lookup in this index is relatively slow, since VictoriaMetrics needs to read the corresponding
data block from disk, unpack it, put the unpacked block into `indexdb/dataBlocks` cache,
and then search for the given `MetricName -> TSID` entry there. So VictoriaMetrics
uses in-memory cache for speeding up the lookup for active time series.
This cache is named `storage/tsid`. If this cache capacity is enough for all the currently ingested
active time series, then VictoriaMetrics works fast, since it doesn't need to read the data from disk.
VictoriaMetrics starts reading data from `MetricName -> TSID` on-disk index in the following cases:
- If `storage/tsid` cache capacity isn't enough for active time series.
Then just increase available memory for VictoriaMetrics or reduce the number of active time series
ingested into VictoriaMetrics.
- If new time series is ingested into VictoriaMetrics. In this case it cannot find
the needed entry in the `storage/tsid` cache, so it needs to consult on-disk `MetricName -> TSID` index,
since it doesn't know that the index has no the corresponding entry too.
This is a typical event under high churn rate, when old time series are constantly substituted
with new time series.
Reading the data from `MetricName -> TSID` index is slow, so inserts, which lead to reading this index,
are counted as slow inserts, and they can be monitored via `vm_slow_row_inserts_total` metric exposed by VictoriaMetrics.
Prior to this commit the `MetricName -> TSID` index was global, e.g. it contained entries sorted by `MetricName`
for all the time series ever ingested into VictoriaMetrics during the configured -retentionPeriod.
This index can become very large under high churn rate and long retention. VictoriaMetrics
caches data from this index in `indexdb/dataBlocks` in-memory cache for speeding up index lookups.
The `indexdb/dataBlocks` cache may occupy significant share of available memory for storing
recently accessed blocks at `MetricName -> TSID` index when searching for newly ingested time series.
This commit switches from global `MetricName -> TSID` index to per-day index. This allows significantly
reducing the amounts of data, which needs to be cached in `indexdb/dataBlocks`, since now VictoriaMetrics
consults only the index for the current day when new time series is ingested into it.
The downside of this change is increased indexdb size on disk for workloads without high churn rate,
e.g. with static time series, which do no change over time, since now VictoriaMetrics needs to store
identical `MetricName -> TSID` entries for static time series for every day.
This change removes an optimization for reducing CPU and disk IO spikes at indexdb rotation,
since it didn't work correctly - see https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1401 .
At the same time the change fixes the issue, which could result in lost access to time series,
which stop receving new samples during the first hour after indexdb rotation - see https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2698
The issue with the increased CPU and disk IO usage during indexdb rotation will be addressed
in a separate commit according to https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1401#issuecomment-1553488685
This is a follow-up for 1f28b46ae9
The number of parts in the snapshot partition may be zero if concurrent goroutine just
started creating new partition, but didn't put data into it yet when the current
goroutine made a snapshot.
This reverts commit 20b18e9feb.
Reason for revert: running goimports on `make check-all` introduces the following issues:
- It runs only on modified files, which weren't commited yet into git repository.
This means the formatting for the remaining files becomes different comparing to the formatting
for the changed files. This also means that the goimports has no any effect
at github actions and when the changed code is already commited to git repository.
- `gomiports` performs formatting in the same way as gofmt, so `make fmt` becomes unnecessary.
But when `gofmt` is substituted with `goimports`, then it performs unnecessary formatting for *.qtpl.go files.
It is possible to make a hack, which will prepare a list of all the *.go files at lib/ and app/
without the *.qtpl.go files, and then feed this list to `goimports`, but this looks too fragile
for the task of just fixing the ordering of Go imports.
So it is better to leave source code formatting as is with `gofmt`, while manually fixing improper ordering
of Go import from time to time in dedicated commits until better solution arises.
Add a break if gotAlert is nil
This removes the following golangci-lint warning:
app/vmalert/alerting_test.go:868:8: SA5011(related information): this check suggests that the pointer can be nil (staticcheck)
if gotAlert == nil {
^
* app/vmctl: fix panic `--remote-read-filter-time-start` flag not defined
* app/vmctl: update CHANGELOG.md
---------
Co-authored-by: Nikolay <nik@victoriametrics.com>
It could happen for low evaluation intervals and irregular
delays during execution that evaluation time would get
a negative offset. This could result into cumulative
discrepancy between the actual time and evaluation time for rules.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* make: add goimports task
Adds task to fix imports formatting implace.
Formats imports into:
- native library
- external libraries
- local packages based on github.com/VictoriaMetrics/VictoriaMetrics prefix
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* make: add goimports install task
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* make: run goimports only for changed files
Applying goimports to all existing files would create a lot of problems with cherry-picking changes between different branches used for development. To avoid this it was decided to only run goimports on changed files to fix formatting gradually.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* make: update goimports to run on all changed files
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
- Clarify docs about -replicationFactor command-line flag at vmselect
- Clarify description for -replicationFactor and -search.skipSlowReplicas command-line flags
- Fix the logic for returning responses if -search.skipSlowReplicas command-line flag
is enabled. The logic was broken in the 173ccf4333,
so it could return responses only if some of vmstorage nodes return error,
while it should return when query results are successfully collected from more than
(len(storageNodes) - replicationFactor) vmstorage nodes.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1207
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/711
* vmselect: introduce `search.skipSlowReplicas` cmd-line flag
vmselect has two logical conditions during request processing when
`-replicationFactor` cmd-line flag is set:
1. If at least `len(storageNodes) - replicationFactor` responded, it could skip
waiting for the rest of nodes to respond. This could lead to problems described
here https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1207.
2. Mark response as partial if less than `len(storageNodes) - replicationFactor` responded
without an error.
The P1 showed itself error-prone and became the main reason why
`-replicationFactor` wasn't recommended to use at vmselect level.
However, this optimization could be still very useful in situations
when there are slow and fast replicas in cluster.
But P2 remains viable and important conditionless.
Hiding P1 behind the feature-flag `search.skipSlowReplicas`
should make `-replicationFactor` flag usable again. And let users
choose whether they want P1 to be respected.
Related issues
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1207https://github.com/VictoriaMetrics/VictoriaMetrics/issues/711
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* docs: update changelog
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* fix removing storage data dir before restoring from backup
Signed-off-by: Alexander Marshalov <_@marshalov.org>
* fix review comment
Signed-off-by: Alexander Marshalov <_@marshalov.org>
* fix review comment
Signed-off-by: Alexander Marshalov <_@marshalov.org>
* fixes after merge with `enterprise-single-node` branch
Signed-off-by: Alexander Marshalov <_@marshalov.org>
---------
Signed-off-by: Alexander Marshalov <_@marshalov.org>
- Clarify the scope of the fix at docs/CHANGELOG.md
- Handle the case when -search.maxSamplesPerSeries limit is exceeded
in the same way as the -search.maxSamplesPerQuery limit.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4472
libcrypto3 and libssl3 in Alpine 3.18.0 have versions `3.1.0-r4`
which contains CVE-2023-2650:
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2023-2650
Use ALpine image 3.18.2 which contains fixed versions of libssl3
and libcrypto3: 3.1.1-r0
NB: In Openshift these containers are marked as vulnerabilities
because of these CVEs.
Error message will be present for any auth error, but message claims an error is about OAuth2 configuration which is confusing.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
The validation was needed for covering corner cases when storage is tested with data from 1970.
This resulted into unexpected search results, as year was parsed incorrectly from the given timestamp.
Co-authored-by: hagen1778 <roman@victoriametrics.com>
It is required for periods to be multiplies, but it was not stated clearly in documentation.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
The change focuses on rectifying inconsistencies in the navigation behavior of the application
and eliminating issues encountered when manually altering the URL.
The key updates include:
- Refactoring of the routing mechanism to handle all possible routes and their states.
- Enhancement of the React Router usage to ensure a smoother navigation experience.
- Handling application state when the URL is manually changed.
Added a line for `probeSelector` and links to objects selectors section to make it easier to find more details about selectors.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
expose `vmauth_user_request_duration_seconds`
and `vmauth_unauthorized_user_request_duration_seconds` summary metrics
for measuring requests latency per user.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
It is impossible to run OS vmauth with the provided config.
The example of using ip filters should be only a part of docs.
All other examples should work seamlessly with OS version.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* docs: add scroll to the selected element
* docs: scroll to root if element not found
* docs: simplify code
* docs: code cleanup
* docs: fix comments (fix code formatting, check element only inside sidebar container)
By default, vmalert will make multiple retry attempts with exponential delay.
The total time spent during retry attempts shouldn't exceed `-remoteWrite.retryMaxTime` (default is 30s).
When retry time is exceeded vmalert drops the data dedicated for `-remoteWrite.url`.
Before, vmalert dropped data after 5 retry attempts with 1s delay between attempts (not configurable).
See `-remoteWrite.retryMinInterval` and `-remoteWrite.retryMaxTime` cmd-line flags.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
- remove second panel for disk usage. It is not very useful for users and brings more confusion than profit from having it.
- update CPU graph to show number of used CPUs to make it less ambiguous
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
This reverts commit c19048dc13.
Reason for revert: it has been appeared that the net/http.ResponseWriter is already buffered,
so there in no need in double bufferring
This simplifies routing at auth proxies such as vmauth to vlselect component,
which serves VMUI - just route all the requests, which start with /select/, to vlselect.
If the `Content-Type: application/json` request header isn't set,
then the server can improperly consume the request body when parsing request parameters
vmalert: retry all errors except 4XX status codes
Retry all errors except 4XX status codes while pushing via remote-write
to the remote storage. Previously, errors like broken connection could
prevent vmalert from retrying the request.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* fix: optimize the preparation of data for the graph
* fix: optimize tooltip rendering
* fix: optimize re-rendering of the chart
* vmui: memory leak fix
* lib/storage: creates parts.json on start-up if it not exists.
It fixes migrations from versions below v1.90.0.
Previously parts.json was created only after successful merge.
But if merge was interruped for some reason (OOM or shutdown), parts.json wasn't created and partitions left after interruped merge weren't properly deleted.
Since VM cannot check if it must be removed or not.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4336
* Apply suggestions from code review
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
* Update lib/storage/partition.go
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
---------
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
This error may be wrapped in another error, and should normally be tested using
`errors.Is(err, net.ErrClosed)`.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
app/vmagent/remotewrite: fix vmagent panic on shutdown
Currently, when vmagent is stopping it first flushes pending series in remote write context and proceeds to stop streaming aggregation. This leads to streaming aggregation being unable to write results into pending timeseries (since it is already nil) and panic.
This can lead to losing some aggregation results being lost almost silently.
The fix is reordering flow to first stop streaming aggregation and flush all pending time series after that.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* deployment/docker: update VictoriaMetrics version from v1.91.1 to v1.91.2 in docker compose files
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* deployment/marketplace: update VictoriaMetrics version from v1.91.1 to v1.91.2 in marketplace files
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* app/vmctl: add verbose output for docker installations or when TTY isn't available
* app/vmctl: fix tests
* app/vmctl: make vmctl interactive if no tty
* app/vmctl: cleanup
* app/vmctl: add comment
---------
Co-authored-by: Nikolay <nik@victoriametrics.com>
* vmalert: fix nil map assignment
The storage instance with nil map params was created for remote-read purposes.
And before change 7a9ae9de0d this map was ignored in ApplyParams.
Now, it started to be used and vmalert panics in runtime.
The fix properly inits map for at `NewVMStorage` and verifies it is not nil
on assignment in `ApplyParams`.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: add to changelog
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: properly clone Storage params
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: properly clone Storage params
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: properly clone Storage params
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
at arm based CPUs only 9 digits after comma matches for tests.
Especially at holtWinters functions. Since it only takes effect at tests
it makes no sense for changing float prescision at actual functions
Previously the location inside the sendPrometheusError() was logged.
This could make hard investigating error locations via `vm_log_messages_total` metric.
This reverts the following commits:
- e0e16a2d36
- 2ce02a7fe6
The reason for revert: the updated logic breaks assumptions made
when fixing https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2698 .
For example, if a time series stop receiving new samples during the first
day after the indexdb rotation, there are chances that the time series
won't be registered in the new indexdb. This is OK until the next indexdb
rotation, since the time series is registered in the previous indexdb,
so it can be found during queries. But the time series will become invisible
for search after the next indexdb rotation, while its data is still there.
There is also incompletely solved issue with the increased CPU and disk IO resource
usage just after the indexdb rotation. There was an attempt to fix it, but it didn't fix
it in full, while introducing the issue mentioned above. See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1401
TODO: to find out the solution, which simultaneously solves the following issues:
- increased memory usage for setups high churn rate and long retention (e.g. what the reverted commit does)
- increased CPU and disk IO usage during indexdb rotation ( https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1401 )
- https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2698
Possible solution - to create the new indexdb in one hour before the indexdb rotation
and to gradually pre-populate it with the needed index data during the last hour before indexdb rotation.
Then the new indexdb will contain all the needed data just after the rotation,
so it won't trigger increased CPU and disk IO.
- Document the change at docs/CHANGELOG.md
- Clarify comments for non-trivial code touched by the commit
- Improve the logic behind maybeCreateIndexes():
- Correctly create per-day indexes if the indexdb rotation is performed during
the first hour or the last hour of the day by UTC.
Previously there was a possibility of missing index entries on that day.
- Increase the duration for creating new indexes in the current indexdb for up to 22 hours
after indexdb rotation. This should reduce the increased resource usage
after indexdb rotation.
It is safe to postpone index creation for the current day until the last hour
of the current day after indexdb rotation by UTC, since the corresponding (date, ...)
entries exist in the previous indexdb.
- Search for TSID by (date, MetricName) in both the current and the previous indexdb.
Previously the search was performed only in the current indexdb. This could lead
to excess creation of per-day indexes for the current day just after indexdb rotation.
- Search for (date, metricID) entries in both the current and the previous indexdb.
Previously the search was performed only in the current indexdb. This could lead
to excess creation of per-day indexes for the current day just after indexdb rotation.
The new index substitutes global MetricName=>TSID index
used for locating TSIDs on ingestion path.
For installations with high ingestion and churn rate, global
MetricName=>TSID index can grow enormously making
index lookups too expensive. This also results into bigger
than expected cache growth for indexdb blocks.
New per-day index supposed to be much smaller and more efficient.
This should improve ingestion speed and reliability during
re-routings in cluster.
The negative outcome could be occupied disk size, since
per-day index is more expensive comparing to global index.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* added backup locking/unlocking against retention policy to vmbackupmanager
Signed-off-by: Alexander Marshalov <_@marshalov.org>
* added docs for new commands
Signed-off-by: Alexander Marshalov <_@marshalov.org>
* fix review comments
Signed-off-by: Alexander Marshalov <_@marshalov.org>
---------
Signed-off-by: Alexander Marshalov <_@marshalov.org>
* lib/storage: follow-up after a50d63c376
- ensure retentionMsecs is rounded to day
- remove localTimeOffset in test as localOffset is ignored when using `UnixMilli`
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/storage: restore retention timezone offset effect on retention deadline
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* feat: improvement of the top queries page
* vmui/docs: enhancements to top queries page
* Apply suggestions from code review
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
vmui: change default font size to 14px for better readability
vmui: fix bug with missing text on buttons in safari
---------
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
* app/vmui: added Labels with the highest number of unique values
* app/vmui: cleanup
* app/vmui: cleanup
* app/vmui: add table description
* app/vmui: fix comment, updated CHANGELOG.md
* app/vmui: disable links
* app/vmui: added actions to the table, it will show values for selected label with the highest number of series
* app/vmui: fix comment
Previously, metric `vmalert_alerting_rules_last_evaluation_series_fetched`
would be set to 0 for const expressions, because const expression do not match
any series. This may result into a confusion: no series were matched but response isn't empty.
The change updates the logic behind metric: if no series were matched but there are samples
in response - use amount of samples as number of series.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: expand rule groups on anchor click
before, anchor click was only updating the URL.
To expand the group, user had to click on rule's block.
Now, group will toggle automatically.
* vmalert: allow filtering group in web UI
The new filter allows to filter groups and rules within
groups by: errors only or noMatch only.
The filtering supposed to help navigating big numbers of groups/rules.
Filtering is reflected in URL, so can be shared as a link.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Without reset, labels duplicates could have been added during stream aggregation.
Since `ctx.Labels` is reused during processing of many series, each series will
add its labels to the context. Even if the same labels were already addeded on prev
iteration. Now, we reset `ctx.Labels` on each iteration to contain so labels from
different series didn't interfere.
This could have cause exceeding of the limit on number of labels per pushed time series.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4277
Signed-off-by: hagen1778 <roman@victoriametrics.com>
This reverts commit 9e99f2f5b3.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4068
Reason for revert: this breaks valid use cases:
- If timestamps aren't specified in the incoming samples on purpose. For example, if stream aggregation is used
as StatsD replacement. StatsD protocol has no timestamp concept for incoming samples.
See https://github.com/b/statsd_spec
- If all the samples must be aggregated, even if they contain stale timestamps.
for example, if the stream aggregation produces some counter of some events,
it may be better to count all the events even if they were delayed before
being ingested into VictoriaMetrics.
Is is also unclear how to determine whether the sample becomes stale.
For example, if the aggregation interval equals to 1h, and the previous
aggregation cycle just finished 10 minutes ago, what to do with the newly
incoming sample with the timestamp 30 minutes older than the current time?
The answer highly depends on the context, so it is unsafe to uncoditionally
use a single logic for dropping the old samples here.
app/vmalert: detect alerting rules which don't match any series at all
vmalert starts to understand /query responses which contain object:
```
"stats":{"seriesFetched": "42"}
```
If object is present, vmalert parses it and populates a new field
`SeriesFetched`. This field is then used to populate the new metric
`vmalert_alerting_rules_last_evaluation_series_fetched` and to
display warnings in the vmalert's UI.
If response doesn't contain the new object (Prometheus or
VictoriaMetrics earlier than v1.90), then `SeriesFetched=nil`.
In this case, UI will contain no additional warnings.
And `vmalert_alerting_rules_last_evaluation_series_fetched` will
be set to `-1`. Negative value of the metric will help to compile
correct alerting rule in follow-up.
Thanks for the initial implementation to @Haleygo
See https://github.com/VictoriaMetrics/VictoriaMetrics/pull/4056
See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4039
Signed-off-by: hagen1778 <roman@victoriametrics.com>
It makes it easier for users who build and self-host images to publish their images without changing tags manually.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
It appears that 90% usage for anonymous mem usage
is already concerning. So we lowering the threshold to 80%.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
previously during sync for mutable and immutable cache parts, link for hotEntry with current date may be not properly updated
it corrupts cache for backfilling metrics and increased cpu load
Windows doesn't allow to remove dir with opened files. Usually it's a case for snapshots, hard cannot be removed if file is openned.
With this change, dir will be renamed and properly deleted at the next process start.
It's recommended to restart vmstorage/vmsingle for snapshots deletion completion periodically.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/70
* vmselect: exit early from queue on context cancel
When `-search.maxConcurrentRequests` is reached, vmselect puts
request in the queue. It is expected, that requests in the queue
will be processed as soon as it would be enough capacity to do so.
However, it could happen that while request was waiting its turn,
the client could have already cancel it (close the connection,
or just close the tab with UI). In this case, we should de-queue
such requests to avoid spending extra resources on them.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* app/vmselect: address review comments
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* lib/promscrape/discovery/kubernetes: add common labels to all ports discovered from endpoints
Sets
`__meta_kubernetes_endpoints_name` and `__meta_kubernetes_namespace` labels to all ports of pod.
Prometheus sets those labels to all ports in pod (0ab9553611/discovery/kubernetes/endpoints.go (L267C15-L269)) even if port is not matching any service.
See: #4154
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/promscrape/discovery/kubernetes: fix test for updated discovery logic
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/streamaggr: discard samples with timestamps not matching aggregation interval
Samples with timestamps lower than `now - aggregation_interval` are likely to be written via backfilling and should not be used for calculation of aggregation.
See #4068
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/streamaggr: make log message more descriptive, fix imports
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Templating of `-external.alert.source` is not expected to have access to the query which was causing runtime error when query function was passed as nil.
See: #4181
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* app/vmagent,lib/persistentqueue: show warning message if `--remoteWrite.maxDiskUsagePerURL` flag lower than 500MB
* app/vmagent,lib/persistentqueue: linter fix
* app/vmagent,lib/persistentqueue: fix comment
* feat: display heatmap in the explore metrics (#4111)
* fix: correct calc step for heatmap
* fix: remove spaces in the result of getDurationFromMilliseconds
* feat: add button "show today" to date picker
* feat: add comparison with the prev day (#3967)
* vmui/docs: add comparison of data to cardinality page
* feat: add WithTemplate page
* app/vmselect/prometheus: enable json mode for expand with expr API
* app/vmselect/prometheus: enable CORS and add content type
* feat: add api for expand with templates
* fix: remove console from useExpandWithExprs
* app/vmselect/prometheus: fix escaping
* vmui: integrate WITH template
* app/vmctl: check content type instead of form param
* fix: add content-type for fetch with-exprs
* fix: add a header to the server's response that allows the "Content-Type" header
* app/vmctl: added comment and cleanup
* app/vmctl: use format query param
---------
Co-authored-by: dmitryk-dk <kozlovdmitriyy@gmail.com>
* app/vmctl: add support for the different time format in the native binary protocol
* app/vmctl: update flag description, update CHANGELOG.md
* app/vmctl: add comment to exported function
* lib/httpserver: introduce `-http.maxConcurrentRequests` command-line flag
Introduce `-http.maxConcurrentRequests` command-line flag to protect
VM components from resource exhaustion during unexpected spikes of HTTP requests.
By default, the new flag's value is set to 0 which means no limits are applied.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* lib/httpserver: mention http.maxConcurrentRequests in docs
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
- added info about metric `vm_vminsert_metrics_read_total`,
- small doc refactoring
- and added make-command for running docs in docker.
Signed-off-by: Alexander Marshalov <_@marshalov.org>
* vmalert: retry datasource requests with EOF or unexpected EOF errors
Retry failed read request on the closed connection one more time.
This may improve rules execution reliability when connection
between vmalert and datasource closes unexpectedly.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: fix old tests
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
This handler will instruct search engines that indexing is not allowed for the content exposed to the internet. This should help to address issues like #4128 when instances are exposed to the internet without authentication.
"How do We Keep Metrics for a Long Time in VictoriaMetrics" article is referenced twice in "Third-party articles and slides about VictoriaMetrics" section
Callers of OpenStorage() log the returned error and exit.
The error logging and exit can be performed inside MustOpenStorage()
alongside with printing the stack trace for better debuggability.
This simplifies the code at caller side.
Use fs.MustReadDir() instead of os.ReadDir() across the code in order to reduce the code verbosity.
The fs.MustReadDir() logs the error with the directory name and the call stack on error
before exit. This information should be enough for debugging the cause of the error.
Callers of CreateFlockFile log the returned err and exit.
It is better to log the error inside the MustCreateFlockFile together with the path
to the specified directory and the call stack. This simplifies
the code at the callers' side while leaving the debuggability at the same level.
Callers of InitFromFilePart log the error and exit.
It is better to log the error with the path to the part and the call stack
directly inside the MustInitFromFilePart() function.
This simplifies the code at callers' side while leaving the same level of debuggability.
Callers of this function log the returned error and exit.
It is better logging the error together with the path to the filename
and call stack directly inside the function. This simplifies
the code at callers' side without reducing the level of debuggability
Callers of this function log the returned error and exit.
Let's log the error with the path to the filename and call stack
inside the function. This simplifies the code at callers' side
without reducing the level of debuggability.
- Add 'BUG:' prefix to error messages related to programming errors aka bugs.
- Consistently log the path to the file in all the messages in order to improve debuggability.
Callers of ReadFullData() log the error and then exit.
So let's log the error with the path to the filename and the call stack
inside MustReadData(). This simplifies the code at callers' side,
while leaving the debuggability at the same level.
Callers of these functions log the returned error and then exit.
Let's log the error with the call stack inside the function itself.
This simplifies the code at callers' side, while leaving the same
level of debuggability in case of errors.
Callers of this function log the returned error and then exit.
Let's log the error with the call stack inside the function itself.
This simplifies the code at callers' side, while leaving the same
level of debuggability in case of errors.
Callers of this function log the returned error and then exit.
Let's log the error with the call stack inside the function itself.
This simplifies the code at callers' side, while leaving the same
level of debuggability in case of errors.
Callers of this function log the returned error and exit.
So let's just log the error with the given filepath and the call stack
inside the function itself and then exit. This simplifies the code
at callers' place while leaves the same level of debuggability in case of errors.
Callers of these functions log the returned error and then exit. The returned error already contains the path
to directory, which was failed to be created. So let's just log the error together with the call stack
inside these functions. This leaves the debuggability of the returned error at the same level
while allows simplifying the code at callers' side.
While at it, properly use MustMkdirFailIfExist instead of MustMkdirIfNotExist inside inmemoryPart.MustStoreToDisk().
It is expected that the inmemoryPart.MustStoreToDick() must fail if there is already a directory under the given path.
When WriteFileAndSync fails, then the caller eventually logs the error message
and exits. The error message returned by WriteFileAndSync already contains the path
to the file, which couldn't be created. This information alongside the call stack
is enough for debugging the issue. So just use log.Panicf("FATAL: ...") inside MustWriteAndSync().
This simplifies error handling at caller side a bit.
This is a follow-up after 42bba64aa7
Previously the part directory listing was fsync'ed implicitly inside partHeader.WriteMetadata()
by calling fs.WriteFileAtomically(). Now it must be fsync'ed explicitly.
There is no need in fsync'ing the parent directory, since it is fsync'ed by the caller
when updating parts.json file.
Previously the created part directory listing was fsynced implicitly
when storing metadata.json file in it.
Also remove superflouous fsync for part directory listing,
which was called at blockStreamWriter.MustClose().
After that the metadata.json file is created, so an additional fsync
for the directory contents is needed.
Improperly configured -bigMergeConcurrency command-line flag usually leads to uncontrolled
growth of unmerged parts, which, in turn, increases CPU usage and query durations.
So it is better deprecating this flag. In rare cases -smallMergeConcurrency command-line flag
can be used instead for controlling the concurrency of background merges.
This makes it easier to understand exact point in time which is included in this backup.
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* fix: correct display of errors for query
* fix: change the logic of histogram detection
* feat: hide empty buckets from the graph
* fix: revert server url
The logic employed for re-using the previously loaded scrape target was broken initially.
The commit cc0427897c tried to fix it, but the new logic
became too complex and fragile. So it is better to just remove this logic,
since the targets from temporarily broken file should be eventually loaded on next
attempts every -promscrape.fileSDCheckInterval
This also allows removing fragile hacks around __vm_filepath label.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3989
* lib/promscrape: fix the problem with scrape work duplicates when file_sd_config can't be read
* lib/promscrape: clarified comment
* lib/promscrape: made better approach to handle a problem with growing []*ScrapeWork on each error when loading config
* lib/promscrape: added CHANGELOG.md
* Update docs/CHANGELOG.md
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* feat: add tips for working with the graph and legend
* feat: add the ability to collapse the legend
* vmui/docs: add the ability to collapse the legend
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* lib/storage: check for free disk space before opening tables
We check for free disk space before call to `openTable`,
so `Storage` can be set to ReadOnly before mergeWorkers start.
Before the change, there was a chance that merges will start
even if Storage has to start in ReadOnly mode because of
`-storage.minFreeDiskSpaceBytes` limit.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4023
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* lib/storage: chore
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Update lib/storage/storage.go
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
- Make sure that the last successfully loaded config is used on hot-reload failure
- Properly cleanup resources occupied by already initialized aggregators
when the current aggregator fails to be initialized
- Expose distinct vmagent_streamaggr_config_reload* metrics per each -remoteWrite.streamAggr.config
This should simplify monitoring and debugging failed reloads
- Remove race condition at app/vminsert/common.MustStopStreamAggr when calling sa.MustStop() while sa
could be in use at realoadSaConfig()
- Remove lib/streamaggr.aggregator.hasState global variable, since it may negatively impact scalability
on system with big number of CPU cores at hasState.Store(true) call inside aggregator.Push().
- Remove fine-grained aggregator reload - reload all the aggregators on config change instead.
This simplifies the code a bit. The fine-grained aggregator reload may be returned back
if there will be demand from real users for it.
- Check -relabelConfig and -streamAggr.config files when single-node VictoriaMetrics runs with -dryRun flag
- Return back accidentally removed changelog for v1.87.4 at docs/CHANGELOG.md
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3639
Verifying status code helps to avoid misleading errors caused by attempt to parse unsuccessful response.
Related issue: #4034
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
- Compare directory names instead of paths to directory when determining which persistent queues must be deleted
This is less error-prone solution, since paths to the same directory can differ, which could lead
to accidental directory removal for the existing -remoteWrite.url
- Log the `removed %d dangling queues` message when at least a single queue has been removed
- Consistently use filepath.Join() for creating paths to persistent queues.
This is needed for Windows support (see https://github.com/VictoriaMetrics/VictoriaMetrics/issues/70 )
- Clarify the description of the change at docs/CHANGELOG.md
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/4014
There is a bug here where if you have a single bucket like:
foo{vmrange="4.084e+02...4.642e+02"} 2 123
The expected output is three le encoded buckets like:
foo{le="4.084e+02"} 0 123
foo{le="4.642e+02"} 2 123
foo{le="+Inf"} 2 123
This correctly encodes the start and end of the vmrange.
If however, the input contains the previous bucket, and that bucket is
empty then you only get the end le and +Inf out currently, i.e:
foo{vmrange="7.743e+05...8.799e+05"} 5 123
foo{vmrange="6.813e+05...7.743e+05"} 0 123
results in:
foo{le="8.799e+05"} 5 123
foo{le="+Inf"} 5 123
This causes issues when you go to compute a quantile because this means
that the assumed lower bound of the buckets is 0 and this we interpolate
between 0->end rather than the vmrange start->end as expected.
- Expose stats.seriesFetched at `/api/v1/query_range` responses too
for the sake of consistency.
- Initialize QueryStats when it is needed and pass it to EvalConfig then.
This guarantees that the QueryStats is properly collected when the query
contains some subqueries.
The change adds a new field `seriesFetched` to EvalConfig object.
Since EvalConfig object can be copied inside `Exec`,
`seriesFetched` is a pointer which can be updated by all copied
objects.
The reason for having stats is that other components, like vmalert,
could benefit from this information.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
using `runtime.Gosched` requires acquiring global lock to check if there are any other goroutines to perform tasks. with the latest versions of runtime it can pause running goroutines automatically without requiring to call `Gosched` directly.
Updates #3966
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
- Use windows.FlushFileBuffers() instead of windows.Fsync() at streamTracker.adviseDontNeed()
for consistency with implementations for other architectures.
- Use filepath.Base() instead of filepath.Split(), since the dir part isn't used.
This simplifies the code a bit.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/70
This is a follow-up for 43b24164ef
* lib/fs: adds memory map for windows
it should improve performance for file reading
* lib/storage: replace '/' with os specific separator
it must fix an errors for windows
* lib/fs: mention windows fsync support
* lib/filestream: adds fdatasync for windows writes
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/70
- Allocate and initialize seriesByWorkerID slice in a single go instead
of initializing every item in the list separately.
This should reduce CPU usage a bit.
- Properly set anti-false sharing padding at timeseriesWithPadding structure
- Document the change at docs/CHANGELOG.md
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3966
* vmselect/promql: refactor `evalRollupNoIncrementalAggregate` to use lock-less approach for parallel workers computation
Locking there is causing issues when running on highly multi-core system as it introduces lock contention during results merge.
New implementation uses lock less approach to store results per workerID and merges final result in the end, this is expected to significantly reduce lock contention and CPU usage for systems with high number of cores.
Related: #3966
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* vmselect/promql: add pooling for `timeseriesWithPadding` to reduce allocations
Related: #3966
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* vmselect/promql: refactor `evalRollupFuncWithSubquery` to avoid using locks
Uses same approach as `evalRollupNoIncrementalAggregate` to remove locking between workers and reduce lock contention.
Related: #3966
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* allowed using dashes and dots in environment variables names for templating config files with envtemplate (#3999)
Signed-off-by: Alexander Marshalov <_@marshalov.org>
* Apply suggestions from code review
---------
Signed-off-by: Alexander Marshalov <_@marshalov.org>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Setting up VMAnomaly on NodeExporter metrics with VictoriaMetrics and AlertManager.
* vmanomaly-guide-draft
* aletr graphs and description
* readme vmanomaly tutorial
* Added back fit_every param for performance
* vmanomaly guide fixes
* added spaces div
* spaces + resize image
* alert example grammar
* quotation marks
* docker link
* typo fixed
* more links
* reader section rephrased
* label change
* lower case for grafana service
* lower case for vm service
* yaml markdown
---------
Co-authored-by: Dima Lazerka <dima@victoriametrics.com>
* lib/netutil: log only parsing errors for proxy-protocol
Previosly every error was logged. With configured TCP health checks at load-balancer or kubernetes, vmauth spams a lot of false positive error message into logs
* Update docs/CHANGELOG.md
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
* Update lib/netutil/tcplistener.go
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
This opens the possibility to remove tssLock from evalRollupFuncWithSubquery()
in the follow-up commit from @zekker6 in order to speed up the code
for systems with many CPU cores.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3966
Call runtime.Gosched() only when there is a work to steal from other workers.
Simplify the timeseriesWorker() and unpackWroker() code a bit by inlining stealTimeseriesWork() and stealUnpackWork().
This should reduce CPU usage when processing queries on systems with big number of CPU cores.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3966
* vmalert: support logs suppressing during config reloads
The change is mostly required for ENT version of vmalert,
since it supports object-storage for config files.
Reading data from object storage could be time-consuming,
so vmalert emits logs to track the progress.
However, these logs are mostly needed on start or on
manual config reload. Printing these logs each time
`rule.configCheckInterval` is triggered would too verbose.
So the change allows to control logs emitting during
config reloads.
Now, logs are emitted during start up or when SIGHUP is receieved.
For periodicall config checks logs emitted by config pkg are suppressed.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: review fixes
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
This commit changes background merge algorithm, so it becomes compatible with Windows file semantics.
The previous algorithm for background merge:
1. Merge source parts into a destination part inside tmp directory.
2. Create a file in txn directory with instructions on how to atomically
swap source parts with the destination part.
3. Perform instructions from the file.
4. Delete the file with instructions.
This algorithm guarantees that either source parts or destination part
is visible in the partition after unclean shutdown at any step above,
since the remaining files with instructions is replayed on the next restart,
after that the remaining contents of the tmp directory is deleted.
Unfortunately this algorithm doesn't work under Windows because
it disallows removing and moving files, which are in use.
So the new algorithm for background merge has been implemented:
1. Merge source parts into a destination part inside the partition directory itself.
E.g. now the partition directory may contain both complete and incomplete parts.
2. Atomically update the parts.json file with the new list of parts after the merge,
e.g. remove the source parts from the list and add the destination part to the list
before storing it to parts.json file.
3. Remove the source parts from disk when they are no longer used.
This algorithm guarantees that either source parts or destination part
is visible in the partition after unclean shutdown at any step above,
since incomplete partitions from step 1 or old source parts from step 3 are removed
on the next startup by inspecting parts.json file.
This algorithm should work under Windows, since it doesn't remove or move files in use.
This algorithm has also the following benefits:
- It should work better for NFS.
- It fits object storage semantics.
The new algorithm changes data storage format, so it is impossible to downgrade
to the previous versions of VictoriaMetrics after upgrading to this algorithm.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3236
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3821
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/70
* app/vmagent: allow vm proto for kafka consumer and producer
it should reduce network usage up to 50%.
According to benchmarks without any encoding at kafka topic, it reduces traffic up to 50%.
With enabled zstd at kafka topic, it shows no diffence in traffic. So it
doesn't make much sense to use it.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1225
* mention eb61a7dd68b834b08d01727a918f207700348ada at changelog
* app/vmagent: bumps kafka lib version
it allows compiling vmagent for arm64 machines
fixes https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2271
* mention d19b1a888248c96cfd7ccee00ba6f596d89be1d7 at change log
* app/vmagent: adds natural concurrency for kafka consumer
it should improve performance for data consumption
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1957
* mention change 0c143bb22ca2e7e0b7eec9bc84a94ee2b41626ca
* Update app/vmagent/kafka/consumer.go
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
* Update app/vmagent/kafka/consumer_cgo.go
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
* vmalert: support concurrent reading from object storage
Config reading from GCS or S3 can be slow if object storage
contains a big number of files. Object storages are usually
fast for downloading and are slow for individual operations.
If there would be thousands of files to read, vmalert could
spend significant time for retrieving those because it is
done sequentially.
The change introduces ability to read configs from object
storage concurrently. By default, both GCS and S3 are now
read with 50 concurrent readers. This significantly reduces
the load time:
* loading 500 files with concurrency=1 takes 27s
* loading 500 files with concurrency=50 takes <1s
* vmalert: add note to Changelog
* vmalert: cleanup
* vmalert: use ticker properly
* app/vmalert: improve status reporting during config loading
* vmalert: support concurrent reading from object storage
Config reading from GCS or S3 can be slow if object storage
contains a big number of files. Object storages are usually
fast for downloading and are slow for individual operations.
If there would be thousands of files to read, vmalert could
spend significant time for retrieving those because it is
done sequentially.
The change introduces ability to read configs from object
storage concurrently. By default, both GCS and S3 are now
read with 50 concurrent readers. This significantly reduces
the load time:
* loading 500 files with concurrency=1 takes 27s
* loading 500 files with concurrency=50 takes <1s
* app/vmalert: make linter happy
* dashboards/cluser: use `quantile` since `median` isn't supported by PromQL
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards/*: add `restarts` annotation to show when there were restarts
The cluster's annotation query is aggregated `by job`,
while vmagent/vmalert are aggregated `by job, instance`.
This is because cluster dashboard can contains too many instances
and annotation could become too noisy.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards/*: support instance filter in Version annotation
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The change also introduces `List` method to `FS` interface.
The `List` method can be used for wildcard support in object storage FS.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Nikolay <nik@victoriametrics.com>
- Sync the description for -httpListenAddr.useProxyProtocol command-line flag at vmagent and vmauth,
so it is consistent with the description at vmauth and victoria-metrics
- Add a sample of panic text to docs/CHANGELOG.md, so it could be googled
- Mention the -httpListenAddr.useProxyProtocol command-line flag in the description for the bugfix
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3335
Each group in vmalert starts with an artifical delay to avoid
thundering herd problem. For some groups with high evaluation
intervals, the delay could be significant.
If during this delay user will remove the group from the config
and hot-reload it - vmalert will have to wait until the delay
ends. This results into slow config reloading and UI hang.
The change moves the start-delay logic back to the group's
`start` method. Now, group can immediately exit from the
delay when `group.close()` method is called.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
lib{mergset,storage}: prevent possible race condition with logging stats for merges
Previously partwrapper could be release by background process and reference for part may be invalid
during logging stats. It will lead to panic at vmstorage
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3897
app/vmctl: vm-native - split migration on per-metric basis
`vm-native` mode now splits the migration process on per-metric basis.
This allows to migrate metrics one-by-one according to the specified filter.
This change allows to retry export/import requests for a specific metric and provides a better
understanding of the migration progress.
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
When group's update() or close() method is called, the group
still need to wait for its current evaluation to finish.
Sometimes, evaluation could take a significant amount of time
which slows configuration update or vmalert's graceful shutdown.
The change interrupts current evaluation in order to speed up
the graceful shutdown or config update procedures.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
- Use flag.Duration instead of flagutil.Duration for -snapshotCreateTimeout,
since the flagutil.Duration is intended mostly for big durations, e.g. days, months and years,
while the -snapshotCreateTimeout is usually smaller than one hour.
- Add links to https://docs.victoriametrics.com/#how-to-work-with-snapshots in docs/CHANGELOG.md,
so readers could easily find the corresponding docs when reading the changelog.
- Properly remove all the created directories on unsuccessful attempt to create
snapshot in Storage.CreateSnapshot().
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3551
* lib/{fs,mergeset,storage}: skip `.must-remove.` dirs when creating snapshot (#3858)
* lib/{mergeset,storage}: add timeout configuration for snapshots creation, remove incomplete snapshots from storage
* docs: fix formatting
* app/vmstorage: add metrics to track status of snapshots
* app/vmstorage: use `vm_http_requests_total` metric for snapshot endpoints metrics, rename new flag to make name more clear
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* app/vmstorage: update flag name in docs
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* app/vmstorage: reflect new metrics names change in docs
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* lib/promscrape: set `vm_promscrape_config_last_reload_successful` to 1 if there was no promscrape config provided
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* lib/promscrape: register `vm_promscrape_config_*` metrics only in case promscrape config is used
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Added `victoria-metrics-linux-s390x` to allow single node builds for `s390x` platform.
Leaving other packaging options at the moment, as on this platform, they're mostly going to be built from/with container images hosted within the company as a base, and not alpine.
* vmselect/promql: check for deadline in `count_values` fn
`count_values` could be very slow during the data processing.
Checking for deadline between iterations supposed to reduce
probability of exceeding `search.maxQueryDuration`.
The change also adds a new trace record, which captures the time
spent in aggregation function. Before that, the trace for aggr funcs
could be confusing since it doesn't account for all the places where
time was spent.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* wip
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* metricsql: support optional 2nd argument for rollup functions
Support optional 2nd argument `min`, `max` or `avg` for rollup functions:
* rollup
* rollup_delta
* rollup_deriv
* rollup_increase
* rollup_rate
* rollup_scrape_interval
If second argument is passed, then rollup function will return only the selected aggregation type.
This change can be useful for situations where only one type of rollup calculation is needed.
For example, `rollup_rate(requests_total[5m], "max")`.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* wip
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
- Return immediately on context cancel during the backoff sleep.
This should help with https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3747
- Add a comment describing why the second attempt to obtain the response from remote side
is perfromed immediately after the first attempt.
- Remove fasthttp dependency from lib/promscrape/discoveryutils
- Set context deadline before calling doRequestWithPossibleRetry().
This simplifies the doRequestWithPossibleRetry() a bit.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3293
* fix: do not use exponential backoff for first retry of scrape request (#3293)
* lib/promscrape: refactor `doRequestWithPossibleRetry` backoff to simplify logic
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* Update lib/promscrape/client.go
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
* lib/promscrape: refactor `doRequestWithPossibleRetry` to make it more straightforward
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
---------
Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
* feat: improve mobile ui
* feat: improve mobile ui
* fix: change style server url
* fix: improve ExploreMetrics mobile
* fix: display global settings on all pages
* update helper scripts to latest versions
* added missed command for initialisation variables from Makefile
* deployment/marketplace/vultr/helper-scripts/vultr-helper.sh: update helper script to latest version
* fixed typo for using VM_VERSION variable
* added an example of specifying the VM_VERSION and tokens for API's
* set packer logging to STDOUT by default
* Modify API version when running in Container App
* Handle expires on from token response
Response from IMDS does not always contain expires in value which is
currently used to get the token expiry time. An example resources that
doesn't provide it are Container Apps and App Service.
Signed-off-by: Mattias Ängehov <mattias.angehov@castoredc.com>
* Fix client id parameter for user assigned identity
* Apply suggestions from code review
---------
Signed-off-by: Mattias Ängehov <mattias.angehov@castoredc.com>
Co-authored-by: Aliaksandr Valialkin <valyala@gmail.com>
- Do not generate __meta_server label, since it is unavailable in Prometheus.
- Add a link to https://docs.victoriametrics.com/sd_configs.html#kuma_sd_configs to docs/CHANGELOG.md,
so users could click it and read the docs without the need to search the corresponding docs.
- Remove kumaTarget struct, since it is easier generating labels for discovered targets
directly from the response returned by Kuma. This simplifies the code.
- Store the generated labels for discovered targets inside atomic.Value. This allows reading them
from concurrent goroutines without the need to use mutex.
- Use synchronouse requests to Kuma instead of long polling, since there is a little sense
in the long polling when the Kuma server may return 304 Not Modified response every -promscrape.kumaSDCheckInterval.
- Remove -promscrape.kuma.waitTime command-line flag, since it is no longer needed when long polling isn't used.
- Set default value for -promscrape.kumaSDCheckInterval to 30s in order to be consistent with Prometheus.
- Remove unnecessary indirections for string literals, which are used only once, in order to improve code readability.
- Remove unused fields from discoveryRequest and discoveryResponse.
- Update tests.
- Document why fetch_timeout and refresh_interval options are missing in kuma_sd_config.
- Add docs to discoveryutils.RequestCallback and discoveryutils.ResponseCallback,
since these are public types.
Side notes: it is weird that Prometheus implementation for kuma_sd_configs sets `instance` label,
since usually this label is set by the Prometheus itself to __address__ after the relabeling phase.
See https://www.robustperception.io/life-of-a-label/
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3389
See https://github.com/prometheus/prometheus/issues/7919
and https://github.com/prometheus/prometheus/pull/8844
as a reference implementation in Prometheus
`lib/protoparser/prometheus` is used by various applications,
such as `app/vmalert`. The recent change to the
`lib/protoparser/prometheus` package introduced a new dependency
of `lib/writeconcurrencylimiter` which exposes some metrics.
Because of the dependency, now all applications which have this
dependency also expose these metrics.
Creating a new `lib/protoparser/prometheus/stream` package helps
to remove these metrics from apps which use `lib/protoparser/prometheus`
as dependency.
See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3761
Signed-off-by: hagen1778 <roman@victoriametrics.com>
`avg` can be affected by just one outlier, which may lead
to false conclusions. `median` is supposed to reflect
reality better by leveling outliers out.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
While at it, stop sending requests to unavailable backend for 3 seconds
before the next attempt. This should reduce the amounts of useless work
and the number of useless network packets when the backend is temporarily unavailable.
* app/vmauth: add concurent requests limit per auth record
* app/vmauth: added clarification comment
* app/vmauth: remove unused code
* app/vmauth: move read from limiter
* app/vmauth: fix text
* app/vmauth: fix comments
* - Clarify the docs for the max_concurrent_requests option at docs/vmauth.md
- Clarify the description of the change at docs/CHANGELOG.md
- Make sure that the -maxConcurrentRequests takes precedence over per-user max_concurrent_requests
- Update tests for verifying that the max_concurrent_requests option is parsed properly
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3346
---------
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
- Document the change at docs/CHANGELOG.md
- Add `Reading rules from object storage` section to docs/vmalert.md
- Add `s3` prefix to command-line flags related to the configuration of s3 and gcs clients
- Explicitly mention that reading rules from object storage is supported only in enterprise version
* vmalert: support object storage for rules
Support loading of alerting and recording rules from object
storages `gcs://`, `gs://`, `s3://`.
* review fixes
* vmalert: use group's ID in UI to avoid collisions
Identical group names are allowed. So we should used IDs
for various groupings and aggregations in UI.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: prevent disabling state updates tracking
The minimum number of update states to track is now set to 1.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: properly update `debug` and `update_entries_limit` params on hot-reload
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: display `debug` field for rule in UI
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: exclude `updates` field from json marhsaling
This field isn't correctly marshaled right now.
And implementing the correct marshaling for it doesn't
seem right, since json representation is mostly used
by systems like Grafana. And Grafana doesn't expect this
field to be present.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* fix test for disabled state
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* fix test for disabled state
Signed-off-by: hagen1778 <roman@victoriametrics.com>
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
docs: clarifications between standalone/cluster ingestion endpoints
This is an attempt to make it a bit clearer to the user that the cluster version ingestion URLs are different from the standalone ones. I have also changed the order of the list items to make it a bit clearer and hopefully stop the user simply inferring that `/prometheus/api/v1` is only related to Prometheus data.
* vmalert: speed up state restore procedure on start
Alerts state restore procedure has been changed to become asynchronous.
It doesn't block groups start anymore which significantly improves vmalert's startup time.
Instead, state restore is called by each group in their goroutines after the first rules
evaluation.
While previously state restore attempt was made for all loaded alerting rules,
now it is called only for alerts which became active after the first evaluation.
This reduces the amount of API calls to the configured remote read URL.
This also means that `remoteRead.ignoreRestoreErrors` command-line flag becomes deprecated now
and will have no effect if configured.
See relevant issue https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2608
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* make lint happy
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Apply suggestions from code review
---------
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
previously historical data backfilling may trigger force merge for previous month every hour
it consumes cpu, disk io and decrease cluster performance.
Following commit fixes it by applying deduplication for InMemoryParts
The limit has been increased from 300 bytes to 500 bytes according to the collected production stats.
This allows reducing CPU usage without significant increase of RAM usage in most practical cases.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3692
This allows better controlling requests to backends and providing better error logging.
For example, if the backend was unavailable, then the ReverseProxy was logging the error
message without client ip and the initial request uri. This could harden debugging.
This is based on https://github.com/VictoriaMetrics/VictoriaMetrics/pull/3486
The docs/assets folder should be used only for assets specific to docs generation at https://docs.victoriametrics.com, e.g. css, js and images.
All the other assets related to specific docs should be placed in the same folder as the corresponding *.md file.
These assets should have the same name prefix as the corresponding doc file name. This simplifies tracking the lifetime of these assets.
For example, if the doc is removed, it is very easy to remove all assets associated with it with a simple `rm -rf docs/doc-name*` command.
This also simplifies generating correct urls for doc-specific assets from both https://docs.victoriametrics.com
and from https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/docs/ - just refer to the asset name without any directory prefixes.
* feat: include fonts in the build
* fix: reduce size fonts
* wip
- Document the change at docs/CHANGELOG.md
- Run `make vmui-update`
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Allow users fine-tuning the maximum string length for interning via -internStringMaxLen command-line flag.
This may be used for fine-tuning RAM vs CPU usage for certain workloads.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3692
- Call httpserver.GetQuotedRemoteAddr() and httpserver.GetRequestURI() only when the error occurs.
This saves CPU time on fast path when there are no parsing errors.
- Create a helper function - httpserver.LogError() - for logging the error with the request uri and remote addr context.
Stale series are sent when there is a difference between current
and previous scrapes. Those series which disappeared in the current scrape
are marked as stale and sent to the remote storage.
Sending stale series requires memory allocation and in case when too many
series disappear in the same it could result in noticeable memory spike.
For example, re-deploy of a big fleet of service can result into
excessive memory usage for vmagent, because all the series with old
pod name will be marked as stale and sent to the remote write storage.
This change limits the number of stale series which can be sent at once,
so memory usage remains steady.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3668https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3675
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
This prevents vmbackup from leaking passwords into logs like shown below.
2023-01-11T15:00:01.050Z info VictoriaMetrics/lib/logger/flag.go:12 build version: vmbackup-20221214-211706-tags-v1.85.1-0-g09a70d3e9
2023-01-11T15:00:01.050Z info VictoriaMetrics/lib/logger/flag.go:13 command-line flags
2023-01-11T15:00:01.050Z info VictoriaMetrics/lib/logger/flag.go:20 -dst="fs:///vm-backups/latest"
2023-01-11T15:00:01.050Z info VictoriaMetrics/lib/logger/flag.go:20 -snapshot.createURL="http://user:super_sercret123@victoriametricspshot/create"
2023-01-11T15:00:01.050Z info VictoriaMetrics/lib/logger/flag.go:20 -storageDataPath="/storage"
2023-01-11T15:00:01.050Z info VictoriaMetrics/app/vmbackup/main.go:53 Snapshot create url http://user:super_sercret123@victoriametrics:8428/snapshot/create
2023-01-11T15:00:01.050Z info VictoriaMetrics/app/vmbackup/main.go:60 Snapshot delete url http://user:super_sercret123@victoriametrics:8428/snapshot/delete
Assisted merges are intended to be performed by goroutines, which accept the incoming samples,
in order to limit the data ingestion rate.
The worker, which converts pending samples to parts, shouldn't be penalized by assisted merges,
since this may result in increased number of pending rows as seen at https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3647#issuecomment-1385039142
when the assisted merge takes too much time.
- Document the fix at docs/CHANGELOG.md
- Limit the concurrency for sendStaleMarkers() function in order to limit its memory usage
when big number of targets disappear and staleness markers are sent
for all the metrics exposed by these targets.
- Make sure that the writeRequestCtx is returned to the pool
when there is no need to send staleness markers.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3668
We have limited amount of time used by Github CI runners
and JS analysis accounts for a half of it.
Since JS represents only a small fraction of the codebase
and is solely maintained by one person - I suggest to disable
the CodeQL check in order to save CI runners time.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* feat: make the step input field global
* fix: correct get step from url
* fix: set minimumSignificantDigits to 1
* app/vmselect/vmui: `make vmui-update`
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
- Add a comment describing the purpose of the `role` field inside `apiConfig` struct
- Revert changes at lib/promscrape/discovery/dockerswarm/dockerswarm.go ,
since they reduce code readability. E.g. the reader needs to look up the named string constants
in order to get their values.
vmselect passes query timeout to vmstorage in seconds.
The commit 20e9598254 treated it as timeout in nanoseconds.
Fix this in order to prevent from the following errors under vmstorage load:
cannot process vmselect request: cannot execute "search_v7": couldn't start executing the request in 0.000 seconds,
since -search.maxConcurrentRequests=... concurrent requests are already executed.
The per-series timestamps are usually shared among series, so it is unsafe modifying them.
The issue has been appeared after the optimization at 2f3ddd4884
* {lib/server, app/}: use `httpAuth.*` flag as fallback for `*AuthKey` if it is not set
* lib/ingestserver/opentsdbhttp: fix opentdb HTTP handler not respecting `httpAuth.*` flags
* Apply suggestions from code review
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* vmagent: add minimal scrape file exampe for vmagent quick start
Signed-off-by: Artem Navoiev <tenmozes@gmail.com>
* replace example with link to your prometheus.yml in docker
Signed-off-by: Artem Navoiev <tenmozes@gmail.com>
Signed-off-by: Artem Navoiev <tenmozes@gmail.com>
- Use promutils.Labels.GetLabels() instead of comparing promutils.Labels.Labels to nil.
This make the code more consistent with other places.
- Mention the release where the issue has been introduced at docs/CHANGELOG.md.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3624
Previously the selected time series were split evenly among available CPU cores
for further processing - e.g unpacking the data and applying the given rollup
function to the unpacked data.
Some time series could be processed slower than others.
This could result in uneven work distribution among available CPU cores,
e.g. some CPU cores could complete their work sooner than others.
This could slow down query execution.
The new algorithm allows stealing time series to process from other CPU cores
when all the local work is done. This should reduce the maximum time
needed for query execution (aka tail latency).
The new algorithm should also scale better on systems with many CPU cores,
since every CPU processes locally assigned time series without inter-CPU communications.
The inter-CPU communications are used only when all the local work is finished
and the pending work from other CPUs needs to be stealed.
Unpack time series with less than 400K samples in the currently running goroutine.
Previously a new goroutine was being started for unpacking the samples.
This was requiring additional memory allocations.
Usually the number of blocks returned per each time series during queries is around 4.
So it is a good idea to pre-allocate 4 block references per time series
in order to reduce the number of memory allocations.
See the list of configs supported by Prometheus at f88a0a7d83/discovery/nomad/nomad.go (L76-L84)
- Removed "token" option. In can be set either via NOMAD_TOKEN env var or via `bearer_token` config option.
- Removed "scheme" option. It is automatically detected depending on whether the `tls_config` is set.
- Removed "services" and "tags" options, since they aren't supported by Prometheus.
- Added "region" option. If it is missing, then the region is read from NOMAD_REGION env var.
If this var is empty, then it is set to "global" in the same way as Nomad client does.
See 865ee8d37c/api/api.go (L297)
and 865ee8d37c/api/api.go (L555-L556)
- If the "server" option is missing, then it is read from NOMAD_ADDR in the same way
as Nomad client does - see 865ee8d37c/api/api.go (L294-L296)
This is a follow-up for 8aee209c53
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3367
This should reduce the maximum memory usage for processScrapedData() function by 2x.
The only part, which can be IO-bound in the processScrapedData() is pushData() call,
when it buffers data to persistent queue if the remote storage cannot keep up
with the data ingestion speed. In this case it is OK if the scrape pace will be limited.
This should reduce memory usage when scraping big number of targets,
since this limits the summary memory usage during concurrent parsing and relabeling
by the number of available CPU cores.
Previously the -maxConcurrentInserts was limiting the number of established client connections,
which write data to VictoriaMetrics. Some of these connections could be idle.
Such connections do not consume big amounts of CPU and RAM, so there is a little sense in limiting
the number of such connections. So now the -maxConcurrentInserts command-line option
limits the number of concurrently executed insert requests, not including idle connections.
It is recommended removing -maxConcurrentInserts command-line option, since the default value
for this option should work good for most cases.
This should prevent from out of memory errors when big number of vmselect
nodes send many concurrent requests to vmstorage
The limit can be controlled at vmstorage via the following command-line flags:
- search.maxConcurrentRequests
- search.maxQueueDuration
See https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#resource-usage-limits
- Document the bugfix at docs/CHANGELOG.md
- Wait until all the worker goroutines are done in consulWatcher.mustStop()
- Do not log `context canceled` errors when discovering consul serviceNames
- Removed explicit handling of gzipped responses at lib/promscrape/discoveryutils.Client,
since this handling is automatically performed by net/http.Transport.
See DisableCompression option at https://pkg.go.dev/net/http#Transport .
- Remove explicit handling of the proxyURL, since it is automatically handled
by net/http.Transport. See Proxy option at https://pkg.go.dev/net/http#Transport .
- Expliticly set MaxIdleConnsPerHost, since its default value equals to 2.
Such a small value may result in excess tcp connection churn
when more than 2 concurrent requests are processed by lib/promscrape/discoveryutils.Client.
- Do not set explicitly the `Host` request header, since it is automatically set by net/http.Client.
- Backport the bugfix to the recently added nomad_sd_configs - see https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3367
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3468
- Remove undocumented `username` and `password` config options from `nomad_sd_config`.
TODO: probably, remove these options from `consul_sd_config` too?
These options exist there for backwards compatibility purposes.
- Add __meta_nomad_service_alloc_id and __meta_nomad_service_job_id meta-labels
These labels contain AllocID and JobID fields for the discovered Nomad services.
- Various typo fixes.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3367
Got "Failed to upgrade legacy queries Datasource $ds was not found" in
Grafana on operator dashboard.
It's datasource variable was incorrectly named `datasource`.
Also made the rest of the dashboards have homogeneous datasource-variable
names and selections, matching vmagent dashboard.
- Show in the line tooltip the number of the query which generates the given line.
This simplifies comparison of lines generated by multiple queries.
- Show metric name as __name__ label in the line tooltip in the same way as other labels are shown there.
This makes the label information in the tooltip more consistent.
- Properly quote label values with JSON.stringify(). This prevents from improper formatting
when label values contain doublequote chars.
- Remove double curly braces artifact at graph legend for lines without names and labels.
- Properly use modifier for regular expressions across the code.
There is no need to manually call `queryDuration.UpdateDuration(startTime)`, because `defer queryDuration.UpdateDuration(startTime)` is executed at the beginning of the function(L660).
Allow configuring the default number of stored rule's update states in memory
via global `-rule.updateEntriesLimit` command-line flag or per-rule via rule's
`update_entries_limit` configuration param.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
This simplifies manual usage of the APIs. For example, the following query
would return the results over the 2022 year.
/api/v1/query_range?start=2022&end=2023&step=1d&query=...
This is equivalent to:
/api/v1/query_range?start=2022-01-01T00:00:00Z&end=2023-01-01T00:00:00Z&step=1d&query=...
Use "unexpected status code returned from %q: %d; expecting %d" log message format
instead of less clear format "unexpected status code returned from %q; expecting %d; got %d"
This is a follow-up for c612bb165e
- Rename `Custom panel` tab to more clear `Query` tab
- Rename `Cardinality` tab to `Explore cardinality`, so it becomes consistent with `Explore metrics` tab
- Move `Dashboards` tab to the end, since it isn't used too much
- Document the feature at docs/CHANGELOG.md.
- Document the metrics explorer at https://docs.victoriametrics.com/#metrics-explorer .
- Properly set `start` and `end` args for the selected time range
when performing the request, which returns metric names.
- Improve queries, so they return lower number of lines and labels.
This should improve metrics' exploration.
- Properly encode label filters and query args before passing them to VictoriaMetrics.
- Various cosmetic fixes.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3386
* app/vmbackupmanager: add metrics for better observability, include more information to `/api/v1/backups` API call response
* app/vmbackupmanager: drop old metrics before creating new ones
* app/vmbackupmanager: use `_total` postfix for counter metrics
* app/vmbackupmanager: remove `_total` postfix for gauge-like metrics
* app/vmbackupmanager: add `_last_run_failed` metrics for backups and retention
* app/vmbackupmanager: address review feedback
* app/vmbackupmanager: fix metric name
* app/vmbackupmanager: address review feedback, remove background updates of metrics, add restoring state of `_last_run_failed` metric from remote storage
* app/vmbackupmanager: improve performance for backup size calculation
* app/vmbackupmanager: refactor backup and retention runs to deduplicate each run logic
* {app/vmbackupmanager,lib/formatutil}: move HumanizeBytes into lib package
* app/vmbackupmanager: fix creating new metrics instead of reusing existing ones
* lit/formatutil: add comment to make linter happy
* app/vmbackupmanager: address review feedback
Previously too short lookbehind window d for rate(m[d]) could be automatically extended
if it didn't cover at least two raw samples. This was needed in order to guarantee
non-empty results from rate(m[d]) on short time ranges.
Now the lookbehind window isn't extended if it is set explicitly,
since it is expected that the user knows what he is doing.
The lookbehind window continues to be extended when needed if it isn't set explicitly.
For example, in the case of rate(m).
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3483
The issue triggers after the indexdb rotation for time series, which stop receiving new samples.
This results in missing data for such time series in query responses.
This commit should address the https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3502
The issue has been introduced in 2dd93449d8
- Document the change at docs/CHANELOG.md
- Log fatal errors if the -loggerJSONFields contains unexpected values
- Rename -loggerJsonFields to -loggerJSONFields for the sake of consistency naming commonly used in Go
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2348
Previously, $job_select, $job_storage and $job_insert
didn't respect the $job filter. This change updates
the variable queries to account for set $job variable.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
This fixes handling of values bigger than 2GiB for the following command-line flags:
- -storage.minFreeDiskSpaceBytes
- -remoteWrite.maxDiskUsagePerURL
Blocked small merges may result into big number of small parts, which, in turn,
may result in increased CPU and memory usage during queries, since queries need to inspect
all the existing small parts.
The issue has been introduced in 8189770c50
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3337
Previously only up to 100K results were cached.
This could result in sub-optimal performance when more than 100K unique strings were actually used.
For example, when the relabeling rule was applied to a million of unique Graphite metric names
like in the https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3466
This commit should reduce the long-term CPU usage for https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3466
after all the unique Graphite metrics are registered in the FastStringMatcher.Transform() cache.
It is expected that the number of unique strings, which are passed to FastStringMatcher.Match(),
FastStringTransformer.Transform() and to InternString() during the last 5 minutes,
is limited, so the function results fit memory. Otherwise OOM crash can occur.
This should be the case for typical production workloads.
The new annotation is hidden by default and suppose to show
component `short_version` label change on the panels.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The system links are absolute, e.g. they start from `/`, so there are high chances
they won't work as expected when requested via proxy such as vmselect with -vmalert.proxyURL
command-line flag.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/3424
http.Request was used as a part of state struct
for generating the curl command when viewing the rule's
state changes.
It appears, that holding a referencing is far more expensive
than generating the curl command immediately.
On the test with 40k rules, this change reduces memory
and CPU usage by 50%.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* lib/promscrape/discovery/azure: remove API server from URL returned by azure
* lib/promscrape/discovery/azure: validate nextLink contains same URL as apiServer
This should simplify further debugging, since the first thing to start the debugging by query trace
is to know the version of VictoriaMetrics, which produced this trace.
Alert `RequestErrorsToAPI` could be permanently triggered due to
mistakes in clients configuration. However, such requests are unlikely
to cause VM health state change. So there is no need in displaying
this alert because there will be no correlation caused by it.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
- Document the change at docs/CHANGELOG.md
- Run `make docs-sync` for copying app/vmgateway/README.md to docs/vmgateway.md
in order to propagate docs' changes to https://docs.victoriametrics.com/vmgateway.html
* vmalert: correctly return error for RW failures
By mistake, in 0989649ad0 the error
for remote write failures weren't return to user.
This change fixes it.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The main purpose of this command-line flag is to increase the lifetime of low-end flash storage
with the limited number of write operations it can perform. Such flash storage is usually
installed on Raspberry PI or similar appliances.
For example, `-inmemoryDataFlushInterval=1h` reduces the frequency of disk write operations
to up to once per hour if the ingested one-hour worth of data fits the limit for in-memory data.
The in-memory data is searchable in the same way as the data stored on disk.
VictoriaMetrics automatically flushes the in-memory data to disk on graceful shutdown via SIGINT signal.
The in-memory data is lost on unclean shutdown (hardware power loss, OOM crash, SIGKILL).
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3337
The new panels have been added to the vmstorage and drilldown rows.
`Disk space usage %` is supposed to show disk space usage percentage.
This panel is now also referred by `DiskRunsOutOfSpace` alerting rule.
This panel has Drilldown option to show absolute values.
`Disk space usage % by type` shows the relation between datapoints
and indexdb size. It supposed to help identify cases when indexdb
starts to take too much disk space.
This panel has Drilldown option to show absolute values.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Method `metrics()` now pre-allocates slices for labels
and results from query responses. This reduces the number
of allocations on the hot path for instant requests.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The recent change in modifying default value
of `datasource.queryStep` flag resulted in situation
where replay mode was always running queries with
step=`datasource.queryStep`. When it should always
use rule's evaluation interval.
The fix is related not to replay mode only, but
for all Range requests. Now step param is set
individually for each mode.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Fixes a missing `&` char in data link for ETA panel
on cluster dashboards. Without `&` char it generates
wrong link when click on Drilldown menu.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* app/vmalert: add `remoteWrite.sendTimeout` command-line flag to configure timeout for sending data to `remoteWrite.url`
* vmalert: remove WriteTimeout from clients Cfg
No need to have it as a part of configuration struct:
* the client isn't used by other packages;
* there are no internal tests to check the WriteTimeout.
* vmalert: remove DisablePathAppend from clients Cfg
No need to have it as a part of configuration struct:
* the client isn't used by other packages;
* there are no internal tests to check the DisablePathAppend.
Co-authored-by: hagen1778 <roman@victoriametrics.com>
- Return meta-labels for the discovered targets via promutils.Labels
instead of map[string]string. This improves the speed of generating
meta-labels for discovered targets by up to 5x.
- Remove memory allocations in hot paths during ScrapeWork generation.
The ScrapeWork contains scrape settings for a single discovered target.
This improves the service discovery speed by up to 2x.
The change list is the following:
* bump Grafana version to 9.2.6;
* replace old "Graph" panel with "TimeSeries" panel;
* show % usage of Mem and CPU additionally to of absolute values;
* `Caches` row was removed. All needed info for caches is now part of `Troubleshooting`;
* add Annotations for Alert triggers. Not all alerts are supposed to be displayed
on the dashboard, but only those with label `show_at: dashboard`.
See `alerts.yml` change.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The change list is the following:
* bump Grafana version to 9.2.6;
* replace old Graph panel with TimeSeries panel;
* add RemoteWrite section;
* allow configuring topK elements for some of the panels;
* Preer grouping by job instead of grouping by instance.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* flag reference update
there is no flag `-datasource.disablePathAppend` and datasource actually checking for `-remoteRead.disablePathAppend`
* update source for doc as well
The change list is the following:
* bump Grafana version to 9.2.6;
* add version change annotations;
* switch to per-job panels instead of per-instance;
* add drilldown option for resource usage panels.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The change list is the following:
* bump Grafana version to 9.2.6;
* remove artifacts in data links.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* {app/vmstorage,app/vmselect}: add API to get list of existing tenants
* {app/vmstorage,app/vmselect}: add API to get list of existing tenants
* app/vmselect: fix error message
* {app/vmstorage,app/vmselect}: fix error messages
* app/vmselect: change log level for error handling
* wip
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* some unexpected DS UIDs were removed;
* replace `$instance.*` filter with `$instance` since we respect
the instance port anyway;
* remove predefined datasource for `clusterbytenant`
in favour of datasource variable `ds`.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The purpose of the update is to make the dash more usable
for large installations with many instances. Panels which showed
metrics per-instance (Mem, CPU) now are showing metrics per-job or min/max/avg
aggregations in % instead. This supposed to help immediately to identify
resource shortage and remain usable for small and big installations.
For cases when detailed info is needed, to the bottom of the dashboard
a new row `Drilldown` was added. Panels like Mem or CPU now contain
a `data-link` named `Drilldown` (cis shown on line click) which takes
user to more detailed panel.
The change list is the following:
* bump Grafana version to 9.1.0;
* replace old "Graph" panel with "TimeSeries" panel;
* improve Uptime panel to show number of instances per job;
* show % usage of Mem and CPU instead of absolute values;
* `Caches` row was removed. All needed info for caches is now part of `Troubleshooting`;
* add `Drilldown` section for detailed resource usage;
* add Annotations for Alert triggers. Not all alerts are supposed to be displayed
on the dashboard, but only those with label `show_at: dashboard`.
See `alerts-cluster.yml` change.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* fix: reset the value of the switches trace and cache
* fix: add cursor text for inputs
* fix: solve the Infinite loop of useFetchQuery.ts
* fix: change condition for show/hide autocomplete
* fix: add limit error length for input
The default list of alerting rules contains the basic
rules for checking vmalert's health state and is recommended
to use for monitoring vmalert deployments.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
It has been appeared that some production workloads could suffer for some time
after every reset of the previous cache when it gets less than 5% of requests
after the needed item isn't found in the current cache. This could result
in reduced cache hit rates, which, in turn, could increase CPU, disk IO and RAM
usage needed for reading, unpacking and caching the missed data from disk.
This commit reduces the cache miss threshold for resetting the previous cache from 5% to 1%.
This should reduce the possible negative impact after each cache reset by at least 5x,
while reducing the total memory used by caches.
This is a follow-up for d906d8573e
* docs/operator: change VMAgentRemoteWriteSettings.MaxDiskUsagePerURL type from int32 to int64
* docs: operator updates api description
Co-authored-by: f41gh7 <nik@victoriametrics.com>
- Remove trailing whitespace at the end of lines
- Remove redundant sentence stating that time series matching the given selector will be deleted.
It should be clear from the surrounding context.
* deployment/docker/docker-compose-cluster.yml: bump VictoriaMetrics Cluster components to the latest v1.83.0 version
* deployment/docker/docker-compose.yml: bump VictoriaMetrics Single node and vmutils to the latest v1.83.0 version
The issue was in the `labels := dst[offset:]` line in the beginning of appendExtraLabels() function.
The `dst` may be re-allocated when adding extra labels to it. In this case the addition of `exported_`
prefix to labels inside `labels` slice become invisible in the returned `dst` labels.
While at it, properly handle some corner cases:
- Add additional `exported_` prefix to clashing metric labels with already existing `exported_` prefix.
- Store scraped metric names in `exported___name__` label if scrape target contains `__name__` label.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3278
Thanks to @jplanckeel for the initial attempt to fix this issue
at https://github.com/VictoriaMetrics/VictoriaMetrics/pull/3281
* feat: apply serverURL on down Enter
* fix: change method of set time range
* fix: remove prevent run fetch without changes
* fix: prevent reset timerange when autorefresh
Previously the `quotesEscape` function was escaping only double quotes.
This wasn't enough, since the input string could contain other special chars,
which must be escaped when put inside JSON string. For example, carriage return and line feed chars (\n\r),
backslash char, etc. This led to the following issues, which were improperly fixed:
- https://github.com/VictoriaMetrics/VictoriaMetrics/issues/890 - this issue
was "fixed" by introducing the `crlfEscape` function, which led to unnecessary
complications in user templates, while not fixing various corner cases
such as backslash chars in the input string.
See 1de15ad490
- https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3139 - this issue
was "fixed" by urlencoding the whole string passed to -external.alert.source
command-line flag. This led to invalid urls, which couldn't be parsed by Grafana.
See 00c838353d
and 4bd0244599
This commit properly encodes the input string passed to `quotesEscape`, so it can be safely embedded inside JSON strings.
This commit deprecates crlfEscape template function and adds the following new template functions:
- strvalue and stripDomain - these functions are supported by Prometheus, so they were added
for compatibility purposes.
- jsonEscape and htmlEscape for converting the input string to valid quoted JSON string
and for html-escaping the input string, so it could be safely embedded as a plaintext
into html.
This commit also documents all supported template functions at https://docs.victoriametrics.com/vmalert.html#template-functions
The deprecated crlfEscape function isn't documented on purpose, since its usefulness is negative in general case.
This reverts commit 00c838353d.
Reason for revert: it incorrectly fixes the issue https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3139 .
Now `-external.alert.source=explore?orgId=1&left=...` is converted to the following invalid url, which cannot be handled by Grafana:
https://grafana.example.com/explore%3ForgId%3D1%26left%3D...
The next commit will contain the correct fix of the issue - the `quotesEscape` function must
properly escape the string, so it could be embedded into JSON string. This function must
properly escape \n\r chars too. In this case the `crlfEscape` function becomes unnecessary.
Actually, the next commit makes the `crlfEscape` function deprecated.
Previously netstorage.MustStop() call didn't free up all the resources,
so the subsequent call to nestorage.Init() would panic.
This allows writing tests, which call nestorage.Init() + nestorage.MustStop() in a loop.
Blocks outside the configured retention are eventually deleted during background merge.
But such blocks may reside in the storage for long time until background merge.
Previously VictoriaMetrics could spend additional CPU time on processing such blocks
during search queries. Now these blocks are skipped.
The searchTSIDs function was searching for metricIDs matching the the given tag filters
and then was locating the corresponding TSID entries for the found metricIDs.
The TSID entries aren't needed when searching for time series names (aka MetricName),
so this commit removes the uneeded TSID search from the implementation of /api/v1/series API.
This improves perfromance of /api/v1/series calls.
This commit also improves performance a bit for /api/v1/query and /api/v1/query_range calls,
since now these calls cache small metricIDs instead of big TSID entries
in the indexdb/tagFilters cache (now this cache is named indexdb/tagFiltersToMetricIDs)
without the need to compress the saved entries in order to save cache space.
This commit also removes concurrency limiter during searching for matching time series,
which was introduced in 8f16388428, since the concurrency
for all the read queries is already limited with -search.maxConcurrentRequests command-line flag.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/648
This increases the maximum time for cache population with new entries from 20 minutes to 40 minutes.
This
This change shouldn't increase memory usage for caches, since the prev cache cleaner
should free up memory by deleting unused prev cache as soon as possible.
See 08ca45d238 for details on prev cache cleaner.
The panic has been introduced in 68f3a02589
While at it, add padding to shard structs in order to avoid false sharing on mordern CPUs
This should improve scalability on systems with many CPU cores
This means that the majority of requests are successfully served from the current cache,
so the previous cache can be reset in order to free up memory.
The message about dropped data still remains at `error` level.
The change supposed to make log message more clear about how
serious it is.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* feat: add maximum display series by tabs
* feat: add warning on PredefinedPanels.tsx
* docs/CHANGELOG.md: vmui limit number of plotted series
* docs/CHANGELOG.md: vmui limit number of plotted series
* wip
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* lib/backup: set s3 default region to us-west-2
it should fix an error with region detection for bucket, if AWS_REGION env var is not set
* Update lib/backup/s3remote/s3.go
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* Multi retention standardization of docs
While reading through the docs the implementation details had different formatting for storageNode. This standardizes them and adds a link to the retention docs.
* Update docs/guides/guide-vmcluster-multiple-retention-setup.md
Co-authored-by: Aliaksandr Valialkin <valyala@gmail.com>
The default value of `-datasource.queryStep` has changed, so we update
the troubleshooting docs accordingly.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Sort labels explicitly after calling the ParsedConfigs.Apply() when needed.
This reduces CPU usage when performing metric-level relabeling, where labels' sorting isn't needed.
Previously the `__address__` label could contain only `host:port` part of the target url,
while the scheme and metrics path were obtained from `__scheme__` and `__metrics_path__`
labels. Now it is possible to set the full url in `__address__` label.
This makes valid the following scrape config, which is frequently used by novice users:
scrape_configs:
- job_name: foo
static_configs:
- targets:
- http://host1/metrics1
- https://host2/metrics2
The stored backups would help to identify docs corruption
but aren't needed for commiting. So `.tmp` backup files
are also git-ignored.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Optimize fast path for /api/v1/import when importing numeric values
* Move the docs about the change from features to bugfixes at docs/CHANGELOG.md
* Update tests at lib/protoparser/vmimport
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3161
* app/vmselect: properly work when export import json from `api/v1/{export, import}` API
* app/vmselect: update convert function
* app/vmselect: export null if `math.IsNaN(v)`
* app/vmselect: get float from json
* lib/protoparser: add test
* docs: add change log
* lib/protoparser: make export import api compatible
* Document the addition of Azure blob storage support in vmbackup / vmrestore
* List the supported storage system types at docs/vmrestore.md
* Mention about azblob storage system support at -src and -dst command-line flags
for vmbackup / vmrestore tools.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1029
* Use vm_account_id and vm_project_id labels to be consistent with https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#multitenancy-via-labels
* Document the feature that vmalert now exposes vm_account_id and vm_project_id
labels if -clusterMode is set.
* Use literal strings instead of string constants for vm_account_id and vm_project_id.
This improves code readability.
The change is supposed to provide additional flexibility for generating alert's
source link based on label values.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The `docs-sync` command was updated to modify images path
for assets in `docs/` folder. The change allows to refer images from `docs`
without copying them to the root folder.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Incorrect 301 redirects can be cached by user agents such as web browsers.
This can complicate recovery procedure after the incorrect redirect is fixed,
e.g. web browser cache must be reset.
The related issue - https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1752
* app/vminsert: allows parsing tenant id from labels
it should help mitigate issues with vmagent's multiTenant mode, which works incorrectly at heavy load
and it cannot handle more then 100 different tenants.
This functional hidden with flag and do not change vminsert default behaviour
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2970
* Update docs/Cluster-VictoriaMetrics.md
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
* wip
* app/vminsert/netstorage: clean remaining labels in order to free up GC
* docs/Cluster-VictoriaMetrics.md: typo fix
* wip
* wip
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Cache `action: replace` results for non-trivial regexs and return them next time
instead of performing CPU-intensive regex replacement.
Optimize also `action: labelmap_all` and `action: replace_all` in the same way.
Previously empty series (e.g. series with all NaN samples) were passed to aggregate functions.
Such series must be ingored by all the aggregate functions.
So it is better from consistency PoV filtering out empty series before applying aggregate functions.
* app/vmselect: ignore empty series for `limit_offset`
VictoriaMetrics doesn't return empty series (with all NaN values) to
the user. But such series are filtered after transform functions.
It means `limit_offset` will account for empty series as well.
For example, let's consider following data set:
```
time series:
foo{label="1"} NaN, NaN, NaN, NaN // empty series
foo{label="2"} 1, 2, 3, 4
foo{label="3"} 4, 3, 2, 1
```
When user requests all series for metric `foo` the empty series
will be filtered out:
```
/query=foo:
foo{label="v2"} 1, 2, 3, 4
foo{label="v3"} 4, 3, 2, 1
```
But `limit_offset(1, 1, foo)` is applied to original series, not filtered yet.
So it will return `foo{label="v2"}` (skips the first in list)
```
/query=limit_offset(1, 1, foo):
foo{label="v2"} 1, 2, 3, 4
```
Expected result would be to apply `limit_offset` to already filtered list,
so in result we receive `foo{label="v3"}`:
```
/query=limit_offset(1, 1, foo):
foo{label="v3"} 4, 3, 2, 1
```
The change does exactly that - filters empty series before applying `limit_offset`.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* app/vmselect: ignore empty series for `limit_offset`
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
According to Ruler specification, only labels returned within time series
should be available for use in annotations.
For long time, vmalert didn't respect this rule. And in PR
https://github.com/VictoriaMetrics/VictoriaMetrics/pull/2403
this was fixed for the sake of compatibility. However, this resulted
into users confusion, as they expected all configured and extra labels
to be available - https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3013
This fix allows to use extra labels in Annotations. But in the case of conflicts
the original labels (extracted from time series) are preferred.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* lib/{httpserver,netutil}: allow to define min and max TLS version of the http server
* lib/httpserver: added descriptions about tls supported versions
* lib/netutil: check minimal tls version, added supported tls versions to error
* wip
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
The workaround was introduced to fix https://github.com/VictoriaMetrics/VictoriaMetrics/issues/962.
However, it didn't prove itself useful. Instead, it is recommended using `increase_pure` function.
Removing the workaround makes VM to produce accurate results when calculating
`delta` or `increase` functions over slow-changing counters with vary intervals
between data points.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
- Clarify the description for -datasource.queryStep command-line flag
- Consistently use a single dash in front of -datasource.queryStep command-line flag
- Update -help output at docs/vmalert.md
- Consistently use single dash in front of command-line flags instead of double dashes.
- Add a warning that too small -search.latencyOffset may lead to incomplete query results.
Change default value for command-line flag `datasource.queryStep` from `0s` to `5m`.
Param `step` is added by vmalert to every rule evaluation request sent to datasource.
Before this change, `step` was equal to group's evaluation interval by default.
Param `step` for instant queries defines how far VM can look back for the last written data point.
The change supposed to improve reliability of the rules evaluation when evaluation interval
is lower than scraping interval.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Cluster components always have `-cluster` suffix. The change fixes
incorrect image tag in docker-compose manifest.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* deployment/docker/docker-compose.yml: adds version tags for VictoriaMetrics containers
* deployment/docker/docker-compose-cluster.yml: adds version tags for VictoriaMetrics containers
* deployment/docker: move cluster compose env to master branch
The change supposed to simplify the process of maintaining for
single/cluster docker-compose envs, alerts, dashboards. It also
supposes to reduce confusion for users when looking for cluster
related alerts/configs.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* deployment/docker: move cluster compose env to master branch
Review updates.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Now vmalert will print the following messages on dupliсates:
```
"recording rule \"record\"; expr: \"up == 1\"; labels: summary={{ value|query }}" is a duplicate within the group "test"
"alerting rule \"alert\"; expr: \"up == 1\"; labels: description={{ value|query }}" is a duplicate within the group "test"
```
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3127
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The standard Snappy encoder from github.com/golang/snappy shows quite good performance number
for compressing the Prometheus remote_write proto messages according to the added benchmarks,
so there is no need in switching to github.com/klauspost/compress/s2 yet.
* dashboards/cluster: few updates
* apply consistent formatting across panels;
* make resource usage panels per component more detailed;
* add extra panels to vmselect for displaying
`vm_rows_read_per_query`, `vm_rows_scanned_per_query`,
`vm_rows_read_per_series` and `vm_series_read_per_query` metrics.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards/single: few updates
* apply consistent formatting across panels;
* add extra panels to Performance for displaying
`vm_rows_read_per_query`, `vm_rows_scanned_per_query`,
`vm_rows_read_per_series` and `vm_series_read_per_query` metrics.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards/vmagent: few updates
* apply consistent formatting across panels;
* add panels for showing number of samples ingested
or scraped;
* adapt resource usage panels for multiple selected jobs/instances;
* add adhoc variable;
* display vmagent's version in Stats.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards/vmalert: few updates
* apply consistent formatting across panels;
* adapt resource usage panels for multiple selected jobs/instances;
* show vmalert version in Stats section.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: always re-evaluate Annotations
Previously, Annotations were evaluated only:
1. On alert creating.
2. On alert's value change.
This is premature optimization. It was assumed that since annotations
could contain only text with alert's labels or value - there is no need
in spending resources to re-compile Annotations.
Later, template function `query` was added, which can execute
arbitrary queries and return different results on every evaluation.
So if it was used in annotations, it would be executed only on init
or value change.
Another case when optimization caused an issue - annotations hot reload.
In this case, annotations of the active alert won't change even if Rule's
annotations were changed.
This fix enables Annotations re-evaluation on each iteration to resolve
issues above. It would have some impact on performance, but it is unlikely
it will be noticeable.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: add tp Changelog
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The change adds an example of `curl` command to the Rule's page.
The command is generated for each recorded state. It is supposed
user can just copy&execute the command to see what was returned
to vmalert.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmselect/promql: add alphanumeric sort by label (sort_by_label_numeric)
* vmselect/promql: fix tests, add documentation
* vmselect/promql: update test
* vmselect/promql: update for alphanumeric sorting, fix tests
* vmselect/promql: remove comments
* vmselect/promql: cleanup
* vmselect/promql: avoid memory allocations, update functions descriptions
* vmselect/promql: make linter happy (remove ineffectual assigment)
* vmselect/promql: add test case, fix behavior when strings are equal
* vendor: update github.com/VictoriaMetrics/metricsql from v0.44.1 to v0.45.0
this adds support for sort_by_label_numeric and sort_by_label_numeric_desc functions
* wip
* lib/promscrape: read response body into memory in stream parsing mode before parsing it
This reduces scrape duration for targets returning big responses.
The response body was already read into memory in stream parsing mode before this change,
so this commit shouldn't increase memory usage.
* wip
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
vmalert: add experimental feature of storing Rule's evaluation state
The new feature keeps last 20 state changes of each Rule
in memory. The state are available for view on the Rule's
view page. The page can be opened by clicking on `Details`
link next to Rule's name on the `/groups` page.
States change suppose to help in investigating cases when Rule
doesn't generate alerts or records.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
This reduces scrape duration for targets returning big responses.
The response body was already read into memory in stream parsing mode before this change,
so this commit shouldn't increase memory usage.
- Rename logDebug() to logDebugf() and pass format string together
with format args directly to logDebugf(). This eliminates fmt.Sprintf()
overhead at logDebug() call site when debugging is disabled.
- Format labels in debug message in Prometheus format, e.g. {label1="value1",...labelN="valueN"}
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3025
`go install` is the preferred way for installing go binaries starting
from the minimum supported Go version for VictoriaMetrics - Go1.18 -
see https://tip.golang.org/doc/go1.18#go-command
The automated push of release tags to Github require specifying the remote repository name
when doing `git push <remote-repo-name> v1.xx.y`.
The remote repository name can differ in different environments,
so it cannot be put into Makefile rule.
TODO: create a Makefile rule, which generates standard remote names for public
and private repositories in Git, so `git push` for release tags could be automated then.
- Add getCommonParamsWithDefaultDuration function and use it at /api/v1/series, /api/v1/labels and /api/v1/label/.../values
- Document the default behaviour for setting 5 minutes time range if start arg isn't passed to /api/v1/series, /api/v1/labels and /api/v1/label/.../values
- Document the change at docs/CHANGELOG.md
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/pull/3052
The panic may trigger during data blocks' processing received
from vmstorage nodes when some of vmstorage nodes return an error
or when `-replicationFactor` is set to values higher than 2 at `vmselect`.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3058
Note that the parallel execution of `union()` args may take more memory and CPU time
than the sequential execution if args contain heavy queries, which may load all the available CPU,
disk and memory resources and vmselect and vmstorage levels.
- Use getScalar() function for obtaining the expected scalar from phi arg
- Reduce the error message returned to the user when incorrect phi is passed to histogram_quantiles
- Improve the description of this bugfix in the docs/CHANGELOG.md
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/3026
Cache sanitized label names and return them next time.
This reduces the number of allocations and speeds up the SanitizeLabelName()
function for common case when the number of unique label names is smaller than 100k
The following regex patterns are optimized:
- literal string match, e.g. "foo"
- prefix match, e.g. "foo.*" and "foo.+"
- substring match, e.g. ".*foo.*" and ".+foo.+"
- alternate values match, e.g. "foo|bar|baz"
This is OK, since the anchors are implicitly applied to the whole regexp.
This optimization should improve the speed for regexp series filters with explicit $ and ^ anchors.
For example, `{label="^(foo|bar)$"}`
- Document the change at docs/CHANGELOG.md
- Move auth token parsing from app/vmagent/opentsdbhttp/ to app/vmagent/main.go,
since it must be parsed only when multitenancy support is enabled at vmagent side.
See https://docs.victoriametrics.com/vmagent.html#multitenancy
These metrics allow alerting when the number of unique series approach the limit.
For example, the following query alerts when the number of series reaches 90% of the configured limit:
vm_hourly_series_limit_current_series / vm_hourly_series_limit_max_series > 0.9
The io/ioutil package is deprecated since Go1.16 - see https://tip.golang.org/doc/go1.16#ioutil
VictoriaMetrics requires at least Go1.18, so it is time to remove the io/ioutil from source code
This is a follow-up for 02ca2342ab
ioutil.ReadAll is deprecated since Go1.16 - see https://tip.golang.org/doc/go1.16#ioutil
VictoriaMetrics requires at least Go1.18, so it is OK to switch from ioutil.ReadAll to io.ReadAll.
This is a follow-up for 02ca2342ab
The ioutil.ReadDir is deprecated since Go1.16 - see https://tip.golang.org/doc/go1.16#ioutil
VictoriaMetrics requires at least Go1.18, so it is time to switch from io.ReadDir to os.ReadDir
This is a follow-up for 02ca2342ab
The ioutil.{Read|Write}File is deprecated since Go1.16 -
see https://tip.golang.org/doc/go1.16#ioutil
VictoriaMetrics needs at least Go1.18, so it is safe to remove ioutil usage
from source code.
This is a follow-up for 02ca2342ab
* lib/storage: bump max merge concurrency for small parts to 15
The change is based on the feedback from users on github.
Thier examples show, that limit of 8 sometimes become a
bottleneck. Users report that without limit concurrency
can climb up to 15-20 merges at once.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Update lib/storage/partition.go
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Flags `-search.maxTagKeys` and `-search.maxTagValues` are present at vmstorage and not at vmselect
```
# ./vmselect-prod -h | egrep 'search.maxTagKeys|search.maxTagValues'
# ./vmstorage-prod -h | egrep 'search.maxTagKeys|search.maxTagValues'
-search.maxTagKeys int
-search.maxTagValues int
```
We switch default alert's source link to redirect user
to vmalert's UI instead of previous JSON object. While it breaks
compatibility, it also supposed to improve user's experience.
The old behavior can be achieved by updating `-external.alert.source`
command-line flag.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The 429 status code means that the server is overwhelmed with requests.
The client can retry the request after some wait time.
Implement this strategy for service discovery and scrape requests.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2940
* Explicitly store a pointer to UserReadableError in the error interface.
Previously Go automatically converted the value to a pointer before storing in the error interface.
* Add Unwrap() method to UserReadableError, so it can be used transparently with the other code,
which calls errors.Is() and errors.As().
* Document the change in docs/CHANGELOG.md
When read query fails, VM returns rich error message with
all the details. While these details might be useful
for debugging specific cases, they're usually too verbose
for users.
Introducing a new error type `UserReadableError` is supposed
to allow to return to user only the most important parts
of the error trace. This supposed to improve error readability
in web interfaces such as VMUI or Grafana.
The full error trace is still logged with the full context
and can be found in vmselect logs.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Previously a single syncwg.WaitGroup was used for tracking the lifetime of processBlock callbacks
across all the per-vmstorage goroutines. This could be slow on systems with many CPU cores
because of inter-CPU synchronization overhead.
Use a separate per-vmstorage sync.WaitGroup instead in order to reduce inter-CPU synchronization overhead.
This should imrpove performance for heavy queries over big number of blocks on multi-CPU systems.
Other components, such as `vmagent`, mark these flags as sensitive and
hide them from the `/metrics` endpoint by default. This commit adds
similar handling to the `vmalert` component, hiding them by default, to
prevent logging of secrets inappropriately.
Showing of these values is controlled by an additional flag.
Follow up to https://github.com/VictoriaMetrics/VictoriaMetrics/pull/2947
* lib/storage: prevent excessive loops when storage is in RO
Returning nil error when storage is in RO mode results
into excessive loops and function calls which could
result into CPU exhaustion. Returning an err instead
will trigger delays in the for loop and save some resources.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* document the change
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
vmalert can be successfully used with datasources
compatible with Prometheus HTTP API. So we remove comments or
notes in Readme which are saying opposite.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* lib/auth: add tests from NewToken function
* lib/auth: update test, fix problem with type conversion
* lib/auth: update test description
* lib/auth: simplify failure tests
The `-mod=vendor` is automatically set when there is a `vendor` directory
starting from Go1.14 - see https://go.dev/doc/go1.14#go-command
Since the minimum supported Go version for VictoriaMetrics is Go1.17,
then the `-mod=vendor` option is no longer needed.
* Update README.md: fix typo to "HA pairs"
Update README.md to fix typo from "HA paris" to "HA pairs"
* Update docs/README.md : Fix of typo from "HA paris"
Update docs/README.md : Fix of typo from "HA paris" to "HA pairs"
* Update of Single-server-VictoriaMetrics.md to fix typo "HA paris" to "HA pairs"
Update of Single-server-VictoriaMetrics.md to fix typo "HA paris" to "HA pairs"
This adds the following push-related metrics when -pushmetrics.url is set:
- metrics_push_interval_seconds
- metrics_push_total
- metrics_push_errors_total
- metrics_push_bytes_pushed_total
- metrics_push_duration_seconds
- metrics_push_block_size_bytes
Updates https://github.com/VictoriaMetrics/metrics/issues/35
* deployment/docker/Makefile: added docker-scan
docker-scan based on native 'docker scan' function that use snyk.io, see https://docs.docker.com/engine/scan/
* set to call 'docker-scan after release binaries but before publishing
Reduce inter-CPU communications when processing the query over big number of time series.
This should improve performance for queries over big number of time series
on systems with many CPU cores.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2896
Based on b596ac3745
Thanks to @zqyzyq for the idea.
Usernames could be duplicate if it has uniq password.
vmauth makes routing based on auth token and username + password combination must be unique for this case.
The new metric `vmagent_remotewrite_queues` exports a static value of
number of configured remote write queus. This metric is useful to
calculate total saturation per each configured URL with given number
of queues. See corresponding changes to vmagent alerts and dashboard.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The change allows to specify default value for `getScrapeInterval`
function when actual interval can't be calculated.
Before the change, function were returning `maxSilenceInterval` (5m)
in such cases, which may be not correct for instant queries processing.
The specific scenario where using `maxSilenceInterval` caused issues
is the following:
1. Series becomes stale;
2. Client (in this case vmalert) continues to request series every 15s;
3. Database returns empty results as expected;
4. But at some specific moment of time database returns datapoints from `now()-5m`,
because lookback window was extended to `maxSilenceInterval`.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmselect: cover special cases for vmalert's routing in single-node version
* remove trailing `/` from requests
* redirect to vmalert's home page when `/vmalert` is requested.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: fix review comments
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Update app/vmselect/main.go
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
- Use binary search instead of linear scan when locating the run of smallest timestamps
in blocks with intersected time ranges. This should improve performance
when merging blocks with big number of samples
- Skip samples with duplicate timestamps. This should increase query performance
in cluster version of VictoriaMetrics with the enabled replication.
* vmalert: deprecate alert's status link
Deprecate alert's status link `/api/v1/<groupID>/<alertID>/status` in favour of
`api/v1/alerts?group_id=<group_id>&alert_id=<alert_id>"`.
The change was needed for simplifying logic in vmselect for proxying vmalert's requests.
The old alert's status link will be still supported for a few versions but will be removed in the future.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2825
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: fix review comments
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Previously the time series could be put into dateMetricIDCache without
registering in the per-day inverted index if GetOrCreateTSIDByName
finds TSID entry in the global index. This could lead to missing
series in query results.
The issue has been introduced in the commit 55e7afae3a,
which has been included in VictoriaMetrics v1.78.0
* docs: warn about potential issue with read queries for 1.78.0
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* docs: warn about potential issue with read queries for 1.78.0
Signed-off-by: hagen1778 <roman@victoriametrics.com>
- show dates in human-readable format, e.g. 2022-05-07, instead of a numeric value
- limit the maximum length of queries and filters shown in trace messages
Production experience shows that 100k is too big for /api/v1/series .
It leads to increased CPU usage when Grafana queries /api/v1/series over VictoriaMetrics
with big number of time series during auto-completion and when modifying template variables.
Previously SearchMetricNames was returning unmarshaled metric names.
This wasn't great for vmstorage, which should spend additional CPU time
for marshaling the metric names before sending them to vmselect.
While at it, remove possible duplicate metric names, which could occur when
multiple samples for new time series are ingested via concurrent requests.
Also sort the metric names before returning them to the client.
This simplifies debugging of the returned metric names across repeated requests to /api/v1/series
querytracer has been added to the following storage.Storage methods:
- RegisterMetricNames
- DeleteMetrics
- SearchTagValueSuffixes
- SearchGraphitePaths
Previously the cache could store 10K unique regexps. When every regexp is huge (e.g. hundreds of kilobytes),
then the total cache size could grow to multiples of gigabytes. Now the cache size is limited by the total length
of all cached regexps. So huge regexps won't result in high memory usage for the cache.
alerts: filter out non error log messages for `TooManyLogs`
Info and Warn error levels aren't always a result of malfunctioning
or faulty state. So we filter them out.
If the previous dial attempt was unsuccessful, then all the new dial attempts are skipped
until the background goroutine determines that the given address can be successfully dialed.
This reduces query latency when some of vmstorage nodes are unavailable and dialing them is slow.
This should help with https://github.com/VictoriaMetrics/VictoriaMetrics/issues/711
This commit is based on ideas from the https://github.com/VictoriaMetrics/VictoriaMetrics/pull/2756
The main differences are:
- The check for healthy/unhealthy storage nodes is moved one level lower from app/vmselect/netstorage to lib/netutil.ConnPool.
This makes possible re-using this feature everywhere lib/netutil.ConnPool is used.
- The check doesn't take into account handshake errors for already established connections.
Handshake errors usually mean improperly configured VictoriaMetrics cluster, so they shouldn't be ignored.
The commit 5fb45173ae takes into account only newly registered series
when applying cardinality limits. This means that the cardinality limit could be exceeded with already registered series.
This commit returns back accounting for already registered series when applying cardinality limits.
Previously the creation of per-day indexes and global indexes
for the newly registered time series was decoupled.
Now global indexes and per-day indexes for the current day are created toghether for new time series.
This should speed up registering new time series a bit.
* vmalert: remove head of line blocking for sending alerts
This change makes sending alerts to notifiers concurrent instead
of sequential. This eliminates head of line blocking, where first
faulty notifier address prevents the rest of notifiers from
receiving notifications.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: make default timeout for sending alerts 10s
Previous value of 1m was too high and was inconsistent
with default timeout defined for notifiers via
configuration file.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: linter checks fix
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmselect: limit `end` param max value by 2d in future
The change is applied only to service handlers like `/labels` or `/series`
and limits the `end` param by max value <= now() + 2 days. The same limit
is applied for the ingested data, so no reason to allow to request data
in future far than that.
The change is also needed for corner cases like https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2669
where too high `end` value triggers inefficient global index search.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* docs/CHANGELOG.md: document the bugfix
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
This allows filling the seriesCountByFocusLabelValue list in the /api/v1/status/tsdb response
with label values for the specified focusLabel, which contain the highest number of time series.
TODO: add this to Cardinality explorer at VMUI - https://docs.victoriametrics.com/#cardinality-explorer
Suggest `-search.maxStalenessInterval=10s` instead of `-search.maxStalenessInterval=1ms`,
since `1ms` would result in empty graphs in most cases, since the interval between data points
on the graph is usually much higher than 1ms. For example, if the graph shows time range of one hour
and it contains 1000 points, then the interval between points on the graph would equal to
3600s/1000=3.6 seconds.
* feat: make datepicker to be set to last 30 min by default
* fix: correct spinner while loading data
* feat: change legend style
* app/vmselect: `make vmui-update`
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
vmalert: support `limit` param in groups definition
`limit` param limits number of time series samples produced by a single rule
during execution.
On reaching the limit rule will return an err.
Signed-off-by: lihaowei <haoweili35@gmail.com>
The change updates histogram for registering SD update duration
only SD is considered as `active`. SD is active if at least
one scraper for this SD has started.
This change supposed to reduce metrics cardinality produced
by duration histogram which gets updated even if SD isn't configured.
See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2671
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Workers count for merges affects the max part size during merges. Such behaviour
protects storage from running out of disk space for scenario when all workers
are merging parts with the max size.
This works very well for most cases. But for systems where high number of CPUs
is allocated for vmstorage components this could significantly impact the max
part size and result in more unmerged parts than expected.
While checking multiple production highly loaded setups it was discovered that
`max_over_time(vm_active_merges{type="storage/big}[1h]}"` rarely exceeds 2,
and `max_over_time(vm_active_merges{type="storage/small}[1h]}"` rarely exceeds 4.
The change in this commit limits the max value for concurrency accordingly.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
- Remove unused js bloatware from /targets page. This strips down binary size by more than 100Kb
- Add /service-discovery page for API compatibility with Prometheus
- Properly load bootstrap.min.css from /prometheus/targets
- Serve static contents for /targets page from app/vminsert instead of app/vmselect, because /targets page is served from there
- Take into account `ca`, `key` and `cert` values when generating string representation of TLSConfig.
Print hashes instead of real values because of security considerations.
- Properly update Config.tlsCertDigets when `key` and `cert` values are set.
This allows properly updating scrape targets after these values are updated in configs.
- Do not re-generate certificate from `key` and `cert` values per each call to getTLSCert,
because these values are immutable.
- Do not set `ca` value from `ca_file` value, so it isn't exposed at `/config` page.
- Generate proper error messages on incorrect `key`, `cert` or `ca` values.
Fixes security vulnerability in nth-check version <=1.0.2
My previous version pin was insufficient, as it was imported again through a different (svgo -> css-select).
- data modification
- more applications for gauges
- recommendations for instrumenting app with metrics
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* lib/promscrape/discovery/kubernetes: properly updates discovered scrape works
previously, added or updated scrapeworks may override previuosly
discovered.
it happens because swosByKey may contain small subset of kubernetes
objects with it's labels.
It happens for objectsUpdated and objectsAdded maps, which include only changed elements
* Properly calculate vm_promscrape_discovery_kubernetes_scrape_works
Co-authored-by: f41gh7 <nik@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
The new metric shows the configured evaluation interval per group.
Metric updates its value when group's interval is changed during
hot reload.
The new metric can be used to estimate how close group
is to start missing evaluation rounds. The following query
will show the % of used time by the group to evaluate all rules
before the next round:
```
(max(vmalert_iteration_duration_seconds{quantile="0.99"}) / vmalert_iteration_interval_seconds) * 100
```
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2618
Signed-off-by: hagen1778 <roman@victoriametrics.com>
This adds the ability to utilize sigv4 signing for all AWS services not
just "aps". When the newly introduced property "service" is not set it
will default to "aps".
Signed-off-by: Boris Petersen <boris.petersen@idealo.de>
docs: update docs for beginners
QuickStart page was updated with more relevant information.
Key Concepts was added to cover basics for the VictoriaMetrics.
Unexpectedly, Grafana makes an extra request to `/rules`
handler in addition to `/api/v1/rules` calls in alerts UI.
This happens only for Grafana versions older than 8.5.*.
Apparently, this is related to support of other monitoring
systems.
Prometheus responds with `text/html` content for UI page `/rules`
to such requests. Actually, returning just a blank page with
SC=200 works as well.
Returning actual response of `/api/v1/rules`
results in error in Grafana since it expects a `yaml` (?) in response.
So we add a placeholder to `vmalert`.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2583
Signed-off-by: hagen1778 <roman@victoriametrics.com>
For liquid text processor double braces `{{` `}}`
are special chars for templating.
Since we use them in some of our docs with different purpose,
we must escape them to avoid syntax errors from liquid.
For escaping curly braces we use bult-in plugin which helps
to enclose sections of text via `{% raw %}` and `{% endraw %}`.
This approach prevents liquid syntax errors and makes render correct.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Rules executor within group tracks series sent to remote write
in order to mark them as stale if they had disappeared in next
evaluation round.
The executor uses rules ID as a key to identifies series which belong to rule.
On config reload, executor remains active but the set of rules could change.
Hence, we need to properly cleanup the tracker for rules which has been disappeared
on config reload.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* deployment/docker: pass `-buildvs=false` to `go build` for production builds
This should resolve the `error obtaining VCS status: exit status 128` error
when the environment contains incorrect version of git or has incorrect access rights
to the directory with VictoriaMetrics source code.
See the following links for additional info:
- https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2508#issuecomment-1117126702 ,
- https://github.com/google/ko/issues/672
- https://github.com/golang/go/issues/49004
* lib/netutil: limit the number of concurrently established connections when calling ConnPool.Get()
This should reduce potential spikes in the number of established connections in the following cases:
- when the connection establishing procedure becomes temporarily slow
- after a temporary spike in the rate of ConnPool.Get() calls
See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2552
* docs/CHANGELOG.md: document c8af625bcc
See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1322#issuecomment-1120276146
* docs/Cluster-VictoriaMetrics.md: typo fix: `by by` -> `by`
* docs: add `resource usage limits` docs, which describe fine-grained tuning for various resource usage limits
* docs/Cluster-VictoriaMetrics.md: the `/api/v1/label/.../values` query can take CPU and ram at both vmstorage and vmselect
* Update root Readme and root vmagent readme
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
This should reduce potential spikes in the number of established connections in the following cases:
- when the connection establishing procedure becomes temporarily slow
- after a temporary spike in the rate of ConnPool.Get() calls
See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2552
* vmalert: calculate time for firing alert based on the given timestamp
Previously, current time was used for checking the `firing` threshold.
This is not correct, since alerts are evaluated at specific timestamps.
Hence, this specific timestamp supposed to be used in the calculation.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: properly calculate evaluation timestamp for rules
Timestamp for rules evaluation should be calculated after
the artifical delay for groups start. Otherwise, evaluation
timestamp can fall back too far in time.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Make it possible to migrate timeseries while restoring the
original timeseries name previously written from Prometheus
to InfluxDB v1 via remote_write.
Fixes: https://github.com/VictoriaMetrics/vmctl/issues/8
Previously ScrapeConfig.clone() was improperly copying promauth.Secret fields -
their contents was replaced with `<secret>` value.
This led to inability to use passwords and secrets in `-promscrape.config` file.
The bug has been introduced in v1.77.0 in the commit 67b10896d2
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2551
Do not assume the db label to be the last one and also
make sure we are not skipping it and everything afterwards.
Breaking the loop would cause following labels to be empty.
* {lib/promscrape,app/vmagent}: adds sigv4 support for vmagent remoteWrite
moves aws related code into separate lib from lib/promscrape
it allows to write data from vmagent to the AWS managed prometheus (cortex)
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1287
* Apply suggestions from code review
* wip
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
vmctl: fix vmctl blocking on process interrupt
This change prevents vmctl from indefinite blocking on
receiving the interrupt signal. The update touches all
import modes and suppose to improve tool reliability.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2491
This adds a metric for the rate limit.
The limit is present as a flag currently:
`flag{name="remoteWrite.rateLimit", value="500000", is_set="true"} 1`
We are running many instances of vmagent and when creating alerts it is harder than it needs to be when extracting the value from the flag.
With this change it should be easier to monitor how close to the limit we are.
`((100/vmagent_remotewrite_rate_limit{account="account"})*sum (rate(vmagent_remotewrite_conn_bytes_written_total{account="account"}))) and ON (account) flag{name="remoteWrite.rateLimit"} == 1`
Function `ValidateTemplates`, used on the vmalert startup,
is supposed to check whether used templates and functions
in loaded rules are correct. The function was parsing
and executing loaded templates.
However, rules may contain functions which can't be executed
without values (label values or query results), like `slice`.
Because of this, validation for completely valid expression
`{{ slice $labels.job 9 }}` will fail since `$labels.job`
is empty during validation.
This PR updates `ValidateTemplates` function to only parse
templates without executing them.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2514
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* lib/{storage,flagutil} - Add option for snapshot autoremoval
- add prometheus-like duration as command flag
- add option to delete stale snapshots
- update duration.go flag to re-use own code
* wip
* lib/flagutil: re-use Duration.Set() call in NewDuration
* wip
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
add progress bars to the VM importer
The new progress bars supposed to display the processing speed per each
VM importer worker. This info should help to identify if there is a bottleneck
on the VM side during the import process, without waiting for its finish.
The new progress bars can be disabled by passing `vm-disable-progress-bar` flag.
Plotting multiple progress bars requires using experimental progress bar pool
from github.com/cheggaaa/pb/v3. Switch to progress bar pool required changes
in all import modes.
The openTSDB mode wasn't changed due to its implementation, which implies individual progress
bars per each series. Because of this, using the pool wasn't possible.
Signed-off-by: dmitryk-dk <kozlovdmitriyy@gmail.com>
Co-authored-by: hagen1778 <roman@victoriametrics.com>
* Export "null" in jsonl instead of NaN
The NaN appeared because of staleness markers that were added for compatibility. I think it's better to use json `null`, implemented here.
Also maybe it also makes sense to add a flag like `?skip-staleness-markers=true` to `/export`, to skip nulls at all?
* Update app/vmselect/prometheus/export.qtpl
* app/vmselect/prometheus/export.qtpl.go: `make quicktemplate-gen`
* docs/CHANGELOG.md: document the change
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* app/vmselect: adds API /api/v1/status/buildinfo
it should fix an compability error with grafana 8.5 prometheus datasource
https://github.com/grafana/grafana/pull/46771
* Update main.go
* dashboards: remove index filter from stats panel for DiskUsage
The diskUsage stats panel was showing disk usage without including
size of the index, which is not correct. The filter was removed
to reflect the total disk usage.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2368
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards: add adhoc filter to dasbhoard variables
The adhoc filter allows to quickly apply global filters without
modifying the panels.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards: add new panel `IndexDB items rate`
The new panel supposed to reflect the pressure on indexDB
caused by churn rate or new series registration.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards: rm "Deferred merges" panel since it could be misleading
See more context here https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1682#issuecomment-938608067
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards: replace fixed interval of `5m` for `rate` expressions
Before we used fixed `5m` interval for expressions with `rate` func.
Unfortunately, this interval wasn't a fit for all the cases. So we
switch to `$__rate_interval` instead.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards: bump version requirement
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards: rm `vm_indexdb_items_added_size_bytes_total` expression
Rate over `vm_indexdb_items_added_size_bytes_total` doesn't seem to be useful
on the dasbhoard panel.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
This should prevent from `duplicate time series` errors when executing the following query:
kube_pod_container_resource_requests{resource="cpu"} * on (namespace,pod) group_left() (kube_pod_status_phase{phase=~"Pending|Running"}==1)
where `kube_pod_status_phase{phase=~"Pending|Running"}==1` filters out diplicate time series
replaces internStringMap with sync.Map - it greatly reduces lock contention
concurently reload scrape work for api watcher - each object labels added by dedicated CPU
changes can be tested with following script https://gist.github.com/f41gh7/6f8f8d8719786aff1f18a85c23aebf70
Reduce the number of memory allocations in this function. This improves its performance by up to 50%.
This should improve service discovery speed when big number of potential targets with big number of meta-labels
are generated by service discovery.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2270
This reduces the number of allocations and improves the performance for updating dropped targets' map.
This map is exposed at /api/v1/targets as in droppedTargets list.
* lib/promscrape: adds job restart method
it must restart only ScrapeConfig with changed content
this change greatly reduce time, that needed for job restart
and it should decrease possible data loss when config frequently changed at kubernetes based deployments
Apply suggestions from code review
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
* wip
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* lib/httpserver: added tlsCipherSuites flag
* lib/httpserver: compare lower case strings
* lib/httpserver: use EqualFold
* lib/httpserver: used flagutil.NewArray, supported only strings cipher suites
* lib/httpserver: updated flag description, added flag to documentation
* Update lib/httpserver/httpserver.go
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
This change fixes incorrect marshalling for ScrapeConfig
it affects http endpoint and ScrapeConfig checksum.
With omitempty, custom Marshaller is not called if field is not a pointer.
Previously this issue happened at vmalert
* lib/promscrape: allows to use k8s pod name as clusterMemberNum
it must improve user expirience and simplify clustering scrapers.
it must allow to use vmagent cluster with distroless images
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2359
* Apply suggestions from code review
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
previously, if native block cannot be unmarshaled, wg.Done wasn't called by unmarshal work.
It leads to connection blocking and possible dead-lock at client side
Before, relabeling for notifier configured via file was supported
only for target labels discovered via SD.
With this change, new config field `alert_relabel_configs` is introduced
for applying relabeling to labels of sent alerts.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
This metric is equivalent to `vm_available_memory_bytes`, but it has better name,
since the metric is related to a process, not VictoriaMetrics itself.
Leave `vm_available_memory_bytes` for backwards compatibility.
To improve compatibility with Prometheus alerting the order of
templates processing has changed.
Before, vmalert did all labels processing beforehand. It meant
all extra labels (such as `alertname`, `alertgroup` or rule labels)
were available in templating. All collisions were resolved in favour
of extra labels.
In Prometheus, only labels from the received metric are available in
templating, so no collisions are possible.
This change makes vmalert's behaviour similar to Prometheus.
For example, consider alerting rule which is triggered by time series
with `alertname` label. In vmalert, this label would be overriden
by alerting rule's name everywhere: for alert labels, for annotations, etc.
In Prometheus, it would be overriden for alert's labels only, but in annotations
the original label value would be available.
See more details here https://github.com/prometheus/compliance/issues/80
Signed-off-by: hagen1778 <roman@victoriametrics.com>
This should improve the performance for items sorting inside inmemoryBlock.MarshalUnsortedData
if they have common prefix.
While at it, improve the performance for inmemoryBlock.updateCommonPrefix for sorted items.
This should improve performance for inmemoryBlock.MarshalSortedData during background merge.
This reduces memory usage under production workloads by up to 10%,
while CPU spent on GC remains roughly the same.
The CPU spent on GC can be monitored with go_memstats_gc_cpu_fraction metric
The following actions are taken:
- Increase the TLS hashdshake timeout from 5 seconds to 10 seconds
- Increase dial timeout from 5 seconds to 30 seconds
- Specify DialContext instead of Dial in http.Transport. This allows properly handling
the Context arg during dialing the remote storage
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1699
* lib/protoparser: changes ParseStream for native format
uses reader instead of http.Request
updates app/vmagent and app/vmagent method usage
* app/vmctl: add verify-block subcommand
it allows to check exported from VictoriaMetrics data block in native format
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2362
Update app/vmctl/README.md
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
The new flag `datasource.disableKeepAlive` allows disabling keepalive
connections. This may be useful if there are multiple datasource
replicas (e.g. vmselects) behind the HTTP balancer to avoid uneven
load spread because of long-lived connections.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Executor recently gain field for storing previously sent series.
Since the same executor object can be used in multiple goroutines,
the access to this field should be serialized.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: split alert's `Start` field into `ActiveAt` and `Start`
The `ActiveAt` field identifies when alert becomes active for rules
with `for > 0`. Previously, this value was stored in field `Start`.
The field `Start` now identifies the moment alert became `FIRING`.
The split is needed in order to distinguish these two moments
in the API responses for alerts.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: support specific moment of time for rules evaluation
The Querier interface was extended to accept a new argument
used as a timestamp at which evaluation should be made.
It is needed to align rules execution time within the group.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: mark disappeared series as stale
Series generated by alerting rules, which were sent to remote write
now will be marked as stale if they will disappear on the next
evaluation. This would make ALERTS and ALERTS_FOR_TIME series
more precise.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* wip
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: evaluate rules at fixed timestamp
Before, time at which rules were evaluated was calculated
right before rule execution. The change makes sure
that timestamp is calculated only once per evalution round
and all rules are using the same timestamp.
It also updates the logic of resending of already resolved
alert notification.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: allow overridin `alertname` label value if it is present in response
Previously, `alertname` was always equal to the Alerting Rule name. Now,
its value can be overriden if series in response containt the different value
for this label.
The change is needed for improving compatibility with Prometheus.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: align rules evaluation in time
Now, evaluation timestamp for rules evaluates as if
there was no delay in rules evaluation. It means, that
rules will be evaluated at fixed timestamps+group_interval.
This way provides more consistent evaluation results and
improves compatibility with Prometheus,
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: add metric for missed iterations
New metric `vmalert_iteration_missed_total` will show
whether rules evaluation round was missed.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: reduce delay before the initial rule evaluation in group
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: rollback alertname override
According to the spec:
```
The alert name from the alerting rule (HighRequestLatency from the example above) MUST be added to the labels of the alert with the label name as alertname. It MUST override any existing alertname label.
```
https://github.com/prometheus/compliance/blob/main/alert_generator/specification.md#step-3
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: throw err immediately on dedup detection
```
The execution of an alerting rule MUST error out immediately and MUST NOT send any alerts
or add samples to samples receiver if there is more than one alert with the same labels
```
https://github.com/prometheus/compliance/blob/main/alert_generator/specification.md#step-4
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: cleanup
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: use strings builder to reduce allocs
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* lib/{storage,regexpcache}: replaces regexpCacheMap with LRU cache
It should decrease memory usage for regexp caching
with storing cacheEntry by pointer - golang map should be able to effectivly shrink it's size
original issue with this case - unexpected map grows and storage OOM
Apply suggestions from code review
Co-authored-by: Roman Khavronenko <roman@victoriametrics.com>
Adds missing metrics for regexp cache and regexpPrefixes cache
* wip
* wip
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* Update url-examples.md
+additional examples
* Update
* Update url-examples
Some changes requested by Roman.
* Update url-examples.md
* Update url-example
* Update url-examples
Additional info and marking for /labels part
* Update url-example
Added example with complex query which needs encoding:
How to execute the query similar to - sum(increase(foo{status="bar"}[5m])) by (status)
* Update url-samples
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* fix for issue 2255 - matchTagFilters for positive empty-match filters
* add example to comments
* formatting
* add test for positive empty match
* formatting
* vmalert: add support for `sortByLabel` template function
* vmalert: update API according to Prometheus conformance program
The changes to the API, field names and URL path has been made
according to the Prometheus specification for `alert_generator`
https://github.com/prometheus/compliance/blob/main/alert_generator/specification.md
* vmalert: fix the timestamp of the evaluated rules
The timestamp used for alert's `EndsAt` was calculated
before sending the notification. While the correct way
is to use the timestamp taken right before rules evaluation.
* vmalert: add `-datasource.queryTimeAlignment` flag
The flag is supposed to provide ability to disable `time`
param alignment when executing rules. By default, this flag
is enabled, so it remains backward compatible.
The flag was introduced to achieve better compatibility
with Prometheus behaviour according to https://github.com/prometheus/compliance/blob/main/alert_generator/specification.md
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Some of the releases could negatively affect performance for a limited
period of time due to some changes in core. Update details are meant to
warn users about expected changes in peformance after the update.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The lifetime of storageBlock is much shorter comparing to the lifetime of inmemoryPart,
so sync.Pool usage should reduce overall memory usage and improve performance
because of better locality of reference when marshaling inmemoryBlock to inmemoryPart.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2247
There is no need to sort the underlying data according to sorted items there.
This should reduce cpu usage when registering new time series in `indexdb`.
Thanks to @ahfuzhang for the suggestion at https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2245
* fixes jwt token parse with correct base64Url decoding
it must be applied according to jwt RFC that requires token to be URL safe
added slow path for decoding tokens with std base64 decoding
adds error logging for vmgateway
* docs/CHANGELOG.md: document the bugfix
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* lib/discovery/consul: update services on the watcher's start
Previously, watcher's start was only initing goroutines for discovery
but not waiting for the first iteration to end. It means first Consul
discovery wasn't returning discovered targets until the next iteration.
The change makes the watcher's start blocking until we get first discovery
iteration done and all registries updated.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: remove workarounds for consul SD
Now when consul SD lib properly updates services
on the first start, we don't need workarounds in vmalert.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* lib/discovery/consul: update after review
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* wip
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* added missed runbook for udpating k8s VM Cluster in DO
* Update deployment/marketplace/digitialocean/one-click-droplet/RELEASE_GUIDE.md
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Lowering threshold from 50% to 5% will be more sufficient
for discovering un-healthy system state. It also goes in
sync with alert definition in cluster branch.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The new row Caches adds more visibility for cache utilization by VM.
It replaces the old `Cache size` panel.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards: plot cpu limits for vmagent, vmalert and vm-single dashboards
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* alerts: add `TooHighCPUUsage` alert for all VM components
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards: bump components version requirements
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Remove unneeded dependency on `numeral` package
* Properly parse numbers obtained from /api/v1/query_range according to
https://prometheus.io/docs/prometheus/latest/querying/api/#expression-query-result-formats
* Optimize updating processing the received data from /api/v1/query_range
* Make smoother zoom on `ctrl+scroll`
* Reduce the number of points received from /api/v1/query_range by 2x in order to reduce load on backend
- Postpone the pre-poulation to the last hour of the current day. This should reduce the number
of useless entries in the next per-day index, which shouldn't be created there,
when the corresponding time series are stopped to be pushed during the current day.
- Make the pre-population more smooth in time by using the hash of MetricID instead of MetricID itself
when calculating the need for for the given MetricID pre-population.
- Sync the logic for pre-population of the next day inverted index with the logic of pre-populating tsid cache
after indexdb rotation. This should improve code maintainability.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/430
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1401
* lib/index: reduce read/write load after indexDB rotation
IndexDB in VM is responsible for storing TSID - ID's used for identifying
time series. The index is stored on disk and used by both ingestion and read path.
IndexDB is stored separately to data parts and is global for all stored data.
It can't be deleted partially as VM deletes data parts. Instead, indexDB is
rotated once in `retention` interval.
The rotation procedure means that `current` indexDB becomes `previous`,
and new freshly created indexDB struct becomes `current`. So in any time,
VM holds indexDB for current and previous retention periods.
When time series is ingested or queried, VM checks if its TSID is present
in `current` indexDB. If it is missing, it checks the `previous` indexDB.
If TSID was found, it gets copied to the `current` indexDB. In this way
`current` indexDB stores only series which were active during the retention
period.
To improve indexDB lookups, VM uses a cache layer called `tsidCache`. Both
write and read path consult `tsidCache` and on miss the relad lookup happens.
When rotation happens, VM resets the `tsidCache`. This is needed for ingestion
path to trigger `current` indexDB re-population. Since index re-population
requires additional resources, every index rotation event may cause some extra
load on CPU and disk. While it may be unnoticeable for most of the cases,
for systems with very high number of unique series each rotation may lead
to performance degradation for some period of time.
This PR makes an attempt to smooth out resource usage after the rotation.
The changes are following:
1. `tsidCache` is no longer reset after the rotation;
2. Instead, each entry in `tsidCache` gains a notion of indexDB to which
they belong;
3. On ingestion path after the rotation we check if requested TSID was
found in `tsidCache`. Then we have 3 branches:
3.1 Fast path. It was found, and belongs to the `current` indexDB. Return TSID.
3.2 Slow path. It wasn't found, so we generate it from scratch,
add to `current` indexDB, add it to `tsidCache`.
3.3 Smooth path. It was found but does not belong to the `current` indexDB.
In this case, we add it to the `current` indexDB with some probability.
The probability is based on time passed since the last rotation with some threshold.
The more time has passed since rotation the higher is chance to re-populate `current` indexDB.
The default re-population interval in this PR is set to `1h`, during which entries from
`previous` index supposed to slowly re-populate `current` index.
The new metric `vm_timeseries_repopulated_total` was added to identify how many TSIDs
were moved from `previous` indexDB to the `current` indexDB. This metric supposed to
grow only during the first `1h` after the last rotation.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1401
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* wip
* wip
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* lib/promscrape: support prometheus-like duration in scrape configs
The change allows to specify duration values like `1d`, `1w`
for fields `scrape_interval`, `scrape_timeout`, etc.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/817#issuecomment-1033384766
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* lib/blockcache: make linter happy
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* lib/promscrape: support prometheus-like duration in scrape configs
* add support for extra fields `scrape_align_interval` and `scrape_offset`;
* support Prometheus duration parsing for `__scrape_interval__`
and `__scrape_duration__` labels;
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* wip
* wip
* docs/CHANGELOG.md: document the feature
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* adds CGO build for arm64
it must improve performance for arm64 based deployments of vmstorage and
vmsingle for 15-20%
it depends on gozstd package update for correct musl gozstd vendoring
* typo fixes
* docs/CHANGELOG.md: document the change
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
This should improve data ingestion speed if time series samples are ingested with interval bigger than 2 minutes.
The actual interval could exceed 2 minutes if the original interval between samples doesn't exceed 2 minutes
in the case of slow inserts. Slow inserts may appear in the following cases:
* Big number of new time series are pushed to VictoriaMetrics, so they couldn't be registered in 2 minutes.
* MetricName->tsid cache reset on indexdb rotation or due to unclean shutdown.
In this case VictoriaMetrics needs to load MetricName->tsid entries for all the incoming series from IndexDB.
IndexDB uses the block cache for increasing lookup performance. If the cache has no the needed block,
then IndexDB reads and unpacks the block from disk. This requires an extra disk read IO and CPU.
See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1401
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2007
This also should increase performance for periodically executed queries with intervals from 2 minutes to 5 minutes.
See the previous similar commit - 43103be011
It is possible that the timeout can be increased further. Let's collect production numbers for this change
so the timeout could be adjusted further.
Previously limits for new caches were taken from cache stats.
These limits could mismatch the original limits. This could result in failed cache load
if the stored cache has been created with the limits obtained from cache stats.
* Mention about the ability to configure vmalert notifiers via files in docs/CHANGELOG.md
* Mention about the ability to use Consul service discovery for vmalert notifiers in docs/CHANGELOG.md
* Run `make docs-sync` in order to sync app/vmalert/README.md to docs/vmalert.md
vmalert: support configuration file for notifiers
* vmalert notifiers now can be configured via file
see https://docs.victoriametrics.com/vmalert.html#notifier-configuration-file
* add support of Consul service discovery for notifiers config
see https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1947
* add UI section for currently loaded/discovered notifiers
* deprecate `-rule.configCheckInterval` in favour of `-configCheckInterval`
* add ability to suppress logs for duplicated targets for notifiers discovery
* change behaviour of `vmalert_alerts_send_errors_total` - it now accounts
for failed alerts, not HTTP calls.
This metric shows the number of CPU cores available to the process.
This allows creating alerting rules on CPU saturation with the following query:
rate(process_cpu_seconds_total[5m]) / process_cpu_cores_available > 0.9
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2107
* optimized code ,because only the first error,so no need var errors []error
* optimized code ,because only the first error,so no need var errors []error
Co-authored-by: lirenzuo <lirenzuo@shein.com>
* FAQ update: how downsampling and deduplication will work at the same time
* Update docs/FAQ.md
Co-authored-by: Aliaksandr Valialkin <valyala@gmail.com>
This adds more optimization cases for https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusLabelNonOptimization
For example:
* Multi-level transform functions. For example, abs(round(foo{a="b"})) + bar{x="y"}
is now optimized to abs(round(foo{a="b",x="y"})) + bar{a="b",x="y"}
* Binary operations with `on()`, `without()`, `group_left()` and `group_right()` modifiers.
For example, foo{a="b"} on (a) + bar is now optimized to foo{a="b"} on (a) + bar{a="b"}
* Multi-level binary operations. For example, foo{a="b"} + bar{x="y"} + baz{z="q"}
is now optimized to foo{a="b",x="y",z="q"} + bar{a="b",x="y",z="q"} + baz{a="b",x="y",z="q"}
* Aggregate functions. For example, sum(foo{a="b"}) by (c) + bar{c="d"}
is now optimized to sum(foo{a="b",c="d"}) by (c) + bar{c="d"}
Previously bytesutil.Resize() was copying the original byte slice contents to a newly allocated slice.
This wasted CPU cycles and memory bandwidth in some places, where the original slice contents wasn't needed
after slize resizing. Switch such places to bytesutil.ResizeNoCopy().
Rename the original bytesutil.Resize() function to bytesutil.ResizeWithCopy() for the sake of improved readability.
Additionally, allocate new slice with `make()` instead of `append()`. This guarantees that the capacity of the allocated slice
exactly matches the requested size. The `append()` could return a slice with bigger capacity as an optimization for further `append()` calls.
This could result in excess memory usage when the returned byte slice was cached (for instance, in lib/blockcache).
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2007
- Optimize Cache.RemoveBlocksFromPart(), so it doesn't need to iterate over all the cached blocks.
- Cache blocks if there were no cache misses during the last 2 minutes.
This may be the case when new blocks are added simultaneously to the storage and to the cache.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2007
* fix: add date validate for time range
* app/vmselect/vmui: `make vmui-update`
* docs/CHANGELOG.md: document the bugfix
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Previously these caches could exceed limits set via `-memory.allowedPercent` and/or `-memory.allowedBytes`,
since limits were set independently per each data part. If the number of data parts was big, then limits could be exceeded,
which could result to out of memory errors.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2007
* feat: replace @codemirror to text field
* feat: switch to Preact from React
* fix: optimize mui imports
* fix: remove unused vars
* update package-lock.json
* app/vmselect/vmui: `make vmui-update`
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
- Document the bugfix at docs/CHANGELOG.md
- Set __address__ field after copying commonLabels to the resulting map of discovered labels.
This makes sure that the correct __address__ label is used.
This reverts commit 2104330d4c.
This check doesn't work well for community pull requests, since third-party users
aren't motivated to rebase pull requests to branch head after they are created.
This check is useful for private repositories though.
* Simplify queries to OpenTSDB (and make them properly appear in OpenTSDB query stats) and also tweak defaults a bit
* Convert seconds to milliseconds before writing to VictoriaMetrics and increase subquery size
Signed-off-by: John Seekins <jseekins@datto.com>
On import process interruption `vmctl` now prints the max and min timestamps of:
* last failed batch if import ended with error;
* last sent batch if import was cancelled by user.
To get more details for each timeseries in batch user needs to specify `--verbose` flag.
The change does not relate to `vm-native` mode, since `vmctl` has no control over
transferred data in this mode.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1236
Signed-off-by: hagen1778 <roman@victoriametrics.com>
It will prevent merging in a branch that's not based on its base branch HEAD, leading to streamlined history.
Note it will not prevent squash commits, nor commits directly to base branch.
* feat: add a reset query by clicking the logo
* feat: add sequence number for query fields
* feat: invert behavior on the graph's legend
* app/vmselect/vmui: `make vmui-update`
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* vmalert: check if remoteWrite is configured for replay mode
The purpose of `replay` mode is to backfill results of recording
or alerting rules. So `remoteWrite.url` should be required.
Otherwise, process can fail on attempt to send data.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Update app/vmalert/main.go
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* vmagent: add error log for skipped data block when rejected by receiving side
Previously, rejected data blocks were silently dropped - only metrics were update.
From operational perspective, having an additional logging for such cases is preferable.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1911
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmagent: throttle log messages about skipped blocks
The new type of logger was added to logger pacakge.
This new type supposed to control number of logged messages
by time.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* lib/logger: make LogThrottler public, so its methods can be inspected by external packages
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
This should handle the case when the original job_name has been changed in -promscrape.config ,
while the resulting job label remains the same because it is overriden via relabeling.
* fix: remove disabling custom step when zooming
* feat: add a dynamic calc of the width of the graph
* fix: add validate y-axis limits
* fix: correct axis limits for value 0
* fix: change logic create time series
* fix: change types for tooltip
* fix: correct points on the line
* fix: change the logic for set graph width
* fix: stop checking the period when auto-refresh is enabled
* app/vmselect/vmui: `make vmui-update`
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* dashboards/vmsingle: add "Merges deferred" panel
The new panel supposed to show if there were deferred merges
due to insufficient disk space.
It goes within alerting rule which suppose to send a signal
in such cases.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards/vmsingle: add "Cache usage" panel
The new panel supposed to show the % of the used cache
compared to allowed size by type.
It should help to determine underutilized types of caches.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards/vmsingle: bump version requirement
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards/vmsingle: rm alert for `vm_merge_need_free_disk_space`
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Update FAQ.md
Adding explanation "Why do same metrics have differences in VictoriaMetrics and Prometheus dashboards?"
* Update FAQ
* Update FAQ.md
* Apply suggestions from code review
Co-authored-by: Roman Khavronenko <hagen1778@gmail.com>
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Co-authored-by: Roman Khavronenko <hagen1778@gmail.com>
* Document changes_prometheus(), increase_prometheus() and delta_prometheus() functions.
* Simplify their implementation
* Mention these functions in docs/CHANGELOG.md
* dashboards/vmagent: shuffle panels for better visibility
More important error/dropped panels were moved higher on the main row.
Network usage panel moved to Resource usage row.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards/vmagent: add Troubleshooting row to show top 5 instances/jobs by churn rate
New panels are supposed to show top 5 jobs or targets which generate the most
of the churn rate. They were placed into a new row "Troubleshooting".
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards/vmagent: add panels for showing persistent queue saturation
New panels were added to Torubleshooting row to show the persistent queue
saturation. The corresponding alerts were added and linked to these
panels as well.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* dashboards/vmagent: add alert "RejectedRemoteWriteDataBlocksAreDropped"
New alert suppose to send a notification when vmagent starts to drop
data blocks rejected by configured remote write destiantion.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* Simplify queries to OpenTSDB (and make them properly appear in OpenTSDB query stats) and also tweak defaults a bit
Signed-off-by: John Seekins <jseekins@datto.com>
When using `vmalert` with older Prometheus versions, the passed
`step=2m` may be parsed by Prometheus with an err: "cannot parse \"2m0s\" to a valid duration".
In order to improve compatibility vmalert will always convert step duration to seconds.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1943
Signed-off-by: hagen1778 <roman@victoriametrics.com>
For example, `{__graphite__=~"foo.(bar|baz)"}` is automatically converted to `{__graphite__=~"foo.{bar,baz}"}` before execution.
This allows using multi-value Grafana template variables such as `{__graphite__=~"foo.($app)"}`.
* feat: add a label for the Query field
* fix: change zoom position
* fix: add description and error code to alerts
* fix: correct logic query history
* fix: correct update query history
* feat: add custom step
* update package-lock.json
* docs: document that VMUI now supports overriding of `step` query arg, which is passed to `/api/v1/query_range`
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Service labels like `alertname` or `alertgroup` were attached
after template expanding for `labels` section. Because of this,
labels `alertname` or `alertgroup` weren't available for templating
in `labels` section of alert's definition.
This commit changes the order of labels attaching and adds a test
for verifying these labels availability.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1921
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* feat: change duration by "enter"
* fix: optimize data processing for chart
* feat: set minimum step to 1ms
* update dependencies
* feat: remove save the last query to local storage
* fix: handle an error in a table with subqueries
* feat: store display type in URL
* Revert "feat: store display type in URL"
This reverts commit ccc242c69a.
* feat: store display type in URL
* refactor: move the time setting to a folder
* refactor: move the query configurator to a folder
* refactor: move the auth settings to a folder
* feat: improve styles
* feat: add multi query
* update package-lock
* feat: add display multiple queries
* feat: add limits for multiple queries
* update dependencies
* feat: add history for multiple queries
* feat: add line type to legend
* feat: change style for switch
* feat: change the logic for axes limits for multiple queries
* update package-lock.json
* update dependencies
* feat: add the filter to legend
* wip
* lib/httpserver: add missing 127.0.0.1 hostname to the logged address for http and pprof server if the address starts with ':'
This allows copy-pasting the url to http server from logs.
* lib/httpserver: add missing 127.0.0.1 hostname to the logged address for http and pprof server if the address starts with ':'
This allows copy-pasting the url to http server from logs.
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* Document the ability to specify http or https urls in `-auth.config` at docs/CHANGELOG.md
* Move the ReadFileOrHTTP to lib/fs, so it can be re-used in other places where a file
should be read from the given path. For example, in `-promscrape.config` at `vmagent`.
* add support for reading remote auth_config file via http
* fix lint
* fix defer on close body
Co-authored-by: Tiago Magalhães <tmagalhaes@wavecom.pt>
* vmalert: introduce additional HTTP URL params per-group configuration
The new group field `params` allows to configure custom HTTP URL params
per each group. These params will be applied to every request before
executing rule's expression. Hot config reload is also supported.
Field `extra_filter_labels` was deprecated in favour of `params` field.
vmalert will print deprecation log message if config file contains
the deprecated field.
`params` fields are supported by both Prometheus and Graphite datasource types.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: provide more examples for `params` field
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmalert: set higher priority for `params` setting
If there would be a conflict between URL params set in `datasource.url` flag
and params in group definition the latter will have higher priority.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The bump was required for `vmalert` package.
`vmalert` docs now also contain an updated description.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The vm_cache_size_max_bytes metric can be used for determining caches which reach their capacity via the following query:
vm_cache_size_bytes / vm_cache_size_max_bytes > 0.9
* removes FileSize from backup part key
it should fix download restoration for backups
* Update lib/backup/common/part.go
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
The exported data isn't de-duplicated by default due to performance reasons.
It is expected that the de-duplication is applied during importing the exported data.
The deduplication is applied only when exporting data via /api/v1/export if `reduce_mem_usage=1` query arg isn't passed to the request.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1837
Previously, vmalert would print an err message and set vmalert_config_last_reload_successful=0
only once during a hot reload of a bad config. Such behaviour may result into non noticed
event of a bad config reload attempt
Now, it continues to print error messages and keep vmalert_config_last_reload_successful state
until successful attempt will be made or config state will be rolled back to prev state.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
For a long time notifier.Addr flag was required. The assumption was that vmalert will
be always used for alerting. However, practice shows that some users need only
recording rules. In this case, requirement of notifier.Addr is ambigious.
The change verifies if loaded config contains recording or alerting rules and
if there are corresponding flags set. This is true for initial config load
and hot reload.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* fix: handle an error in a table with subqueries
* feat: store display type in URL
* Revert "feat: store display type in URL"
This reverts commit ccc242c69a.
* Simplify queries to OpenTSDB (and make them properly appear in OpenTSDB query stats) and also tweak defaults a bit
Signed-off-by: John Seekins <jseekins@datto.com>
* remove extraneous printlns
Signed-off-by: John Seekins <jseekins@datto.com>
* remove empty line
Signed-off-by: John Seekins <jseekins@datto.com>
* fix bug in offset calcuation and closer to working with simpler queries
Signed-off-by: John Seekins <jseekins@datto.com>
* fix boolean eval
Signed-off-by: John Seekins <jseekins@datto.com>
* fix casting and check for multiple series
Signed-off-by: John Seekins <jseekins@datto.com>
Describe remoteWrite.url is used to persist rules and alerts state info,
and add an additional paragraph explaining the separation between
-remoteRead.url and -datasource.url.
Fixes#1810.
* feat: add query history
* fix: change detect keyUp for nav query history
* feat: set default query history
* feat: change graph legend
* update dependencies
* update codemirror version
* fix: correct update period time after zoom/pan
* fix: optimize data processing for the graph
* fix: eliminate memory leaks related to mouse events
* fix: correct display of straight line
* Merge branch 'master' into vmui-fix-reset-graph
* app/vmselect: `make vmui-update`
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
This removes the unneeded level of indirection and improves code readability.
The "prometheus" and "graphite" constants aren't going to change in the future, so there is no sense in hiding them behind constants.
This should improve the maximum data ingestion speed for highly-loaded vmagent instances
which run on beefy servers with many CPU cores and big amounts of RAM
Previously only the lower part of 64-bit hash was used for calculating the offset.
This may give uneven distribution in some cases. So let's use all the available 64 bits from the hash
for calculating the offset.
Prometheus allows to have groups with no rules, so we should support
it in vmalert as well for compatibility reasons.
It is also allowed to hot-reload empty groups by adding or removing rules.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Do not store in memory the response from the last scrape per each target if -promscrape.noStaleMarkers option is enabled.
This should reduce memory usage when the scraped targets return large responses.
Previously, ID for alert entity was generated without alertname or groupname.
This led to collision, when multiple alerting rules within the same group
producing same labelsets. E.g. expr: `sum(metric1) by (job) > 0` and
expr: `sum(metric2) by (job) > 0` could result into same labelset `job: "job"`.
The issue affects only UI and Web API parts of vmalert, because alert ID is used
only for displaying and finding active alerts. It does not affect state restore
procedure, since this label was added right before pushing to remote storage.
The change now adds all extra labels right after receiving response from the datasource.
And removes adding extra labels before pushing to remote storage.
Additionally, change introduces a new flag `Restored` which will be displayed in UI
for alerts which have been restored from remote storage on restart.
* adds tab as second separator for graphite text protocol
* changes indexFunc for indexAny
* Update lib/protoparser/graphite/parser_test.go
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* feat: add query history
* fix: change detect keyUp for nav query history
* feat: set default query history
* app/vmselect/vmui: `make vmui-update`
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
This should make visible the set flags at flag.Visit(), which is used later for logging
and exporting the `is_set` label for these flags at /metrics page
Commit fixes potential race condition when group update
and generating of ID() happens simultaneously.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Regression was introduced during code refactoring. It potentially
could lead to situation when SIGHUP signals were ignored while
vmalert was still busy with initing group manager.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The following errors:
vendor/cloud.google.com/go/storage/storage.go:1447:53: o.GetCustomerEncryption().GetKeySha256 undefined (type *"google.golang.org/genproto/googleapis/storage/v2".Object_CustomerEncryption has no field or method GetKeySha256)
vendor/cloud.google.com/go/storage/writer.go:439:10: q.GetCommittedSize undefined (type *"google.golang.org/genproto/googleapis/storage/v2".QueryWriteStatusResponse has no field or method GetCommittedSize)
The extra `/` may cause issues when additional path prefixes
are configured. Also, removing it makes it consistent
with the rest of declarations.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
vmctl: properly convert influx bools into integer representation
When using vmctl influx, the import would fail importing boolean fields
with:
```
failed to convert value "some".0 to float64: unexpected value type true
```
This converts `true` to `1` and `false` to `0`.
Fixes#1709
Sort series by a hash calculated from the series labels. This should guarantee "random" selection of the returned time series.
Previously the selection could be biased, since time series were sorted alphabetically by label names and label values.
Stream parsing mode can be automatically enabled when scraping targets with big response bodies
exceeding the -promscrape.minResponseSizeForStreamParse , so it must be always initialized.
This allows sending staleness marks and properly calculate scrape_series_added metric in stream parsing mode
at the cost of the increased memory usage, since now the potentially big response is kept
in the lastScrape byte slice per each scrapeWork.
In practice the memory usage increase shouldn't be big, since the response size
is usually much smaller than the parsed metrics from this response after the relabeling,
which usually adds a big pile of target-specific labels per each metric.
* vmalert: adjust `http.Transport.MaxIdleConns` value accordingly to `http.Transport.MaxIdleConnsPerHost`
`http.Transport.MaxIdleConnsPerHost` setting is controlled by `datasource.maxIdleConnections` flag,
while `http.Transport.MaxIdleConns` is inherited from DefaultTransport and is equal to `100`.
The fix adjusts `http.Transport.MaxIdleConns` value if it is lower than `http.Transport.MaxIdleConnsPerHost`.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* vmauth: adjust `http.Transport.MaxIdleConns` value accordingly to `http.Transport.MaxIdleConnsPerHost`
`http.Transport.MaxIdleConnsPerHost` setting is controlled by `maxIdleConnsPerBackend` flag,
while `http.Transport.MaxIdleConns` is inherited from DefaultTransport and is equal to `100`.
The fix adjusts `http.Transport.MaxIdleConns` value if it is lower than `http.Transport.MaxIdleConnsPerHost`.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The source link is controlled by `external.url` and `external.alert.source`
flags, in the same way as for alertmanager notifications.
The source link is added to Alerts list view, and specific Alert view.
* added guide for VM operator
* Update docs/guides/getting-started-with-vm-operator.md
Co-authored-by: Roman Khavronenko <hagen1778@gmail.com>
* Update docs/guides/getting-started-with-vm-operator.md
Co-authored-by: Roman Khavronenko <hagen1778@gmail.com>
* Fixed different typos and added improvements from proposals
* move remoteWrite.url to other place
* fixed typo
* rephrased vminsert explanation
* remove not needed parameters for default setup
Co-authored-by: Roman Khavronenko <hagen1778@gmail.com>
* adds read-only mode for vmstorage
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/269
* changes order a bit
* moves isFreeDiskLimitReached var to storage struct
renames functions to be consistent
change protoparser api - with optional storage limit check for given openned storage
* renames freeSpaceLimit to ReadOnly
The list of functions, which can adjust lookbehind window is more limited than the rest of functions,
so it is better from maintainability and readability PoV using the allowlist instead of blocklist.
It is expected that the `deriv(m[d])` returns non-empty value if the lookbehind window `d`
contains less than 2 samples in the same way as `rate()` does.
This is a follow-up after 3e084be06b .
Previously, `predict_linear` returned slightly different results comparing
to Prometheus. The change makes linear regression algorithm compatible
with Prometheus.
`deriv` was excluded from the list of functions which can adjust the time
window for the same reasons.
The fix makes the binary comparison func to check for NaNs
before executing the actual comparison. This prevents VM
to return values for non-existing samples for expressions
which contain bool comparisons. Please see added test
for example.
It appeared, that `testRowsEqual` NaN comparison was incorrect.
The fix caused some tests to fail. Please see the change and
tests updated.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
Adjustment results into discrepancy between Prometheus and VM on time windows
smaller than scrape interval.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The fix will always return zero if received set of items consists of one
element only, which also means no deviation.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The change affects `count/stddev/stdvar_over_time` funcs and makes
them to return NaN instead of zero when there is no datapoints
in a time window.
This is needed for improving compatibility with Prometheus.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The removed fast path optimisations weren't consistent with
`quantile` function behavior and results into discrepancy.
Specifically, results didn't match in cases when:
* 0 < phi > 1;
* values contain only one element.
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* app/vmselect: `quantile` func compatiblity with Prometheus
The `quantile` func was previously calculated by https://github.com/valyala/histogram
package. The result of such calculation was always the closest real value to
requested quantile. While in Prometheus implementation interpolation is used.
Such difference may result into discrepancy in output between Prometheus and
VictoriaMetrics.
This commit adds a Prometheus-like `quantile` function. It also used by other
functions which depend on it, such as `quantiles`, `quantile_over_time`, `median` etc.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1625
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* app/vmselect: `quantile` review fixes
* quantile functions were split into multiple to provide
different API for already sorted data;
* float64sPool is used for reducing allocations. Items in pool may have
different sizes, but defining a new pool was complicates due to name collisions;
Signed-off-by: hagen1778 <roman@victoriametrics.com>
* app/vmselect: make sorting for query result similar to Prometheus
Updated sorting allows to get the order of series in result similar or equal
to what Prometheus returns.
The change is needed for compatibility reasons.
* Update app/vmselect/promql/exec_test.go
Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
`omitempty` tag resulted into skipping this param on marshaling,
which was used as a checksum for groups configuration. Since on
config reload checksums are compared before applying changes,
any change to `interval` only didn't trigger config reload.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1641
Signed-off-by: hagen1778 <roman@victoriametrics.com>
The commit contains the following changes:
- Show vmui when requesting /graph page in order to be compatible with Prometheus datasource in Grafana.
- Properly encode query args at vmui url.
- Set the number of points on the graph to the number of horizontal pixels divided by 2. Previously it was hardcoded to 30.
- Do not save server url to persistent storage at browser, since it should be always obtained from the url.
- Run `make vmui-update` for updating vmui embedded into VictoriaMetrics.
* feat: change url params for compatible prometheus
* style: add comment for TimeParams
* fix: change get default server for single version
* fix: change function for get query string value
* vmalert: add flag to limit the max value for auto-resovle duration for alerts
The new flag `rule.maxResolveDuration` suppose to limit max value for
alert.End param, which is used by notifiers like Alertmanager for alerts auto resolve.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1586
Also reduce CPU usage when applying `series_limit` to scrape targets with constant set of metrics.
The main idea is to perform the calculations on scrape_series_added and series_limit
only if the set of metrics exposed by the target has been changed.
Scrape targets rarely change the set of exposed metrics,
so this optimization should reduce CPU usage in general case.
These actions simlify metrics filtering. For example,
- action: keep_metrics
regex: 'foo|bar|baz'
would leave only metrics with `foo`, `bar` and `baz` names, while the rest of metrics will be deleted.
The commit also makes possible to split long regexps into multiple lines. For example, the following config is equivalent to the config above:
- action: keep_metrics
regex:
- foo
- bar
- baz
New UI pages:
/ - welcome page with API handlers list;
/groups - list of all rules per group;
/alerts - list of all active alerts;
/groupID/alertID/status - status of the active alert;
The number of series per target can be limited with the following options:
* Global limit with `-promscrape.maxSeriesPerTarget` command-line option.
* Per-target limit with `max_series: N` option in `scrape_config` section.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1561
* dasbhoard: replace `null` datasources
null datasource value may confuse Grafana and make it drop panel query in some
versions.
* docker: bump grafana image version
* dashboards: add URL variable selector to vmagent dashboard
* dashboards: add new panel `Remote write connection saturation` to vmagent dashboard
* alerts: add new alert for `Remote write connection saturation` panel of vmagent dashboard
* dashboards: add "Logging rate" panel to vmagent dashboard
The purpose of update is to make README and flags description more
clear to the reader. Especially, show that vm-account-id flag is required
for clustered version of VM.
* vmalert: allow extra GET params in datasource package
ExtraParams will be added as GET params to every HTTP request made by datasource.
The `roundDigits` param, for example, was substituted by corresponding extra param.
* vmalert: add nocache=1 param for replay process
The `nocache=1` param is VictoriaMetrics specific parameter which prevents it
from caching and boundaries aligning for queries. We set it to avoid cache
pollution in `replay` mode and also to avoid unnecessary time range boundaries
alignment.
* vmalert: mention nocache=1 in replay description
* vmalert: fix bug with unused param
* vmalert: remove `vmalert_execution_duration_seconds` metric
The summary for `vmalert_execution_duration_seconds` metric gives no additional
value comparing to `vmalert_iteration_duration_seconds` metric.
* vmalert: update config reload success metric properly
Previously, if there was unsuccessfull attempt to reload config and then
rollback to previous version - the metric remained set to 0.
* vmalert: add Grafana dashboard to overview application metrics
* docker: include vmalert target into list for scraping
* vmalert: extend notifier metrics with addr label
The change adds an `addr` label to metrics for alerts_sent and alerts_send_errors
to identify which exact address is having issues.
The according change was made to vmalert dashboard.
* vmalert: update documentation and docker environment for vmalert's dashboard
Mention Grafana's dashboard in vmalert's README in a new section #Monitoring.
Update docker-compose env to automatically add vmalert's dashboard.
Update docker-compose README with additional info about services.
* Document all the functions supported by MetricsQL, including PromQL functions
* Group functions by their type: rollup functions, transform functions, label manipulation functions and aggregate functions.
* Document implicit query transformations.
Store the scraped response body instead of storing the parsed and relabeld metrics.
This should reduce memory usage, since the response body takes less memory than the parsed and relabeled metrics.
This is especially true for Kubernetes service discovery, which adds many long labels for all the scraped metrics.
This should also reduce CPU usage, since the marshaling of the parsed
and relabeld metrics has been substituted by response body copying.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1526
Before, metric `vm_request_duration_seconds` was update only on successful
attempts which could be misleading. For example, timeout errors on netstorage
request may be not accounted in the metric and won't be visible on dashboards.
Using `defer` statement to update the metric after query arguments validation
may improve the situation.
This option allows reducing CPU usage a bit when VictoriaMetrics is used
for collecting and processing non-Prometheus data. For example, InfluxDB line protocol, Graphite, OpenTSDB, CSV, etc.
This option can be useful when vmagent consumes too much additional memory
for staleness markers functionality and when staleness markers aren't needed.
* vmalert: allow to disable automatically added path to remote write address via disablePathAppend flag
* docs: update docs to include remoteWrite.disablePathAppend
This metric can be used for determining high saturation of every connection to remote storage with
an alerting query `rate(vmagent_remotewrite_send_duration_seconds_total) > 0.9s`.
This query triggers when a connection is satureated by more than 90%
This reverts commit 94dfcb6747a3b29a11d14e71bea21a2312bb6346.
It is better to remove staleness marks (decimal.StaleNaN) before calling rollupConfig.Do, e.g. in preFunc
Prometheus stalenss marks shouldn't be changed in removeCounterResets. Otherwise they will be converted to an ordinary NaN values,
which couldn't be removed in dropStaleNaNs() function later. This may result in incorrect calculations for rollup functions.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1526
* docs: update "number of open files" tuning recomendation
Make "number of open files" recomendation not only Prometheus specific to avoid
confusion for users who does not use Prometheus.
* docs: mention fstrim in Tuning section
* vmalert: expose new metrics for tracking number of produced samples during last evaluation
Two new metrics were added to track the number of samples produced during the last evaluation:
* vmalert_recording_rules_last_evaluation_samples
* vmalert_alerting_rules_last_evaluation_samples
The gauge type is used to remain consistent with Prometheus metric
`prometheus_rule_group_last_evaluation_samples` which is on the group level.
However, the counter type was considered as well.
Two metrics instead of one are used to make it easier to separate recording and
alerting rules. It is likely, number of samples produced by recording rules is
more important so people will refer to it more frequently.
The expected usage of the new metric is the following:
```
- alert: RecordingRuleReturnsEmptyResults
expr: sum(vmalert_recording_rules_last_evaluation_samples) by(recording) < 1
annotations:
summary: Recording rule {{$labels.recording}} returns empty results.
Please verify expression correctness.
```
Addresses https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1494
* vmalert: rename `vmalert_alerts_error` to `vmalert_alerting_rules_error` to remain consistent with recording rules metrics
* feature: Add multitenant for vmagent
* Minor fix
* Fix rcs index out of range
* Minor fix
* Fix multi Init
* Fix multi Init
* Fix multi Init
* Add default multi
* Adjust naming
* Add TenantInserted metrics
* Add TenantInserted metrics
* fix: remove unused metrics for vmagent
* fix: remove unused metrics for vmagent
Co-authored-by: mghader <marc.ghader@ubisoft.com>
Co-authored-by: Sebastian YEPES <syepes@gmail.com>
* Rename -search.maxMetricsPointSearch to -search.maxSamplesPerQuery, so it is more consistent with the existing -search.maxSamplesPerSeries
* Move the -search.maxSamplesPerQuery from vmstorage to vmselect, so it could effectively limit the number of raw samples obtained from all the vmstorage nodes
* Document the -search.maxSamplesPerQuery in docs/CHANGELOG.md
* fix: move request button to server input
* feat: add switch for query autocomplete
* refactor: rename state for popover open
* feat: add detect os by userAgent
* fix: change hotkey to run query for mac
* fix: change detect mac os
* fix: change div to span inside Typography
Co-authored-by: yury <yurymolodov@victoriametrics.com>
This should improve the readability and usefullness of the /api/v1/status/top_queries when debugging slow queries
or queries that take too much cpu time.
Previously the switch occurred when the cache size becomes 100% of its capacity. The cache size could never reach 100% capacity.
This could prevent from switching from the split cache to full cache, thus reducing the cache effectiveness.
- Support durations anywhere in MetricsQL queries. E.g. sum_over_time(m[1h])/1h is equivalent to sum_over_time(m[1h])/3600
- Support durations without suffix. E.g. rate(m[300]) is equivalent to rate(m[5m])
Previously needsDedup() could return true if the de-duplication wasn't needed for the following case:
d < interval
/ \
| v | v |
interval interval
Now it properly returns false for this case
This reverts commit 7c6d3981bf.
Reason for revert: high contention at bucket16Pool on systems with big number of CPU cores.
This slows down query processing significantly.
This should reduce memory usage on systems with big number of CPU cores,
since every inmemoryPart object occupies at least 64KB of memory and sync.Pool maintains
a separate pool inmemoryPart objects per each CPU core.
Though the new scheme for the pool worsens per-cpu cache locality, this should be amortized
by big sizes of inmemoryPart objects.
CPU and memory profiles show that the pool capacity for inmemoryBlock objects is too small.
This results in the increased load on memory allocation code in Go runtime.
Increase the pool capacity in order to reduce the load on Go runtime.
This should improve hit ratio for tagFiltersCache when big number of new time series are constantly registered
(aka high churn rate). This, in turn, should reduce CPU usage for queries over such time series.
Previously the stats for cache misses could be improperly counted, because it had inflated cache misses
if the entry was missing in the curr cache, but was existing in the prev cache.
The same applies to cache requests - they were inflated if the entry was missing in the curr cache.
To add a Copy button wrap code snippet with the following element:
```
<div class="with-copy" markdown="1">
<your-code-snippet>
</div>
```
See the changes to `Kubernetes monitoring with VictoriaMetrics Single` for details.
Previously the switch from `split` to `whole` mode had been performed too early,
e.g. when the current cache size became bigger than 1/4 of the allowed cache size.
Now it is performed when the current cache size becomes bigger than 1/2 of the allowed cache size.
This change can reduce memory usage for data ingestion path when big number of active time series are ingested.
One minute cache timeout result in slower queries in some production workloads where the interval
between query execution is in the range 1 minute - 2 minutes.
This should reduce memory usage on a system with high number of active time series and a high churn rate.
One minute is enough for caching the blocks needed for repeated queries (e.g. alerting rules, recording rules and dashboard refreshes).
This should reduce memory usage for the pool on systems with big number of CPU cores.
The sync.Pool maintains per-CPU pools, so the total number of objects in the pool
is proportional to the number of available CPU cores. The channel limits the number
of pooled objects by its own capacity. This means smaller number of pooled objects on average.
This should reduce resource usage (CPU, RAM, disk IO) at vmstorage nodes
if the addresses of vmstorage nodes are passed in random order to vminsert nodes.
* Change default value of '-remoteWrite.queues' to cgroup.AvailableCPUS() * 2 to reduce scrape interval
Default value of vmagent option '-remotewrite.queues' is 4 and default
size of vmagent ScheudleUnmarshalWorkers is number of CPUs, when available
CPUs is much greater than 4, e.g 32, worker are competing push queues
which will increase scrape interval and may cause scrape timeout.
* Update README and flag description
Co-authored-by: xiaozy <xiaozy01@fenbi.com>
Due to staleness handling, increase_pure were using incorrect previous value
during calculation in cases where series disappears for period longer
than staleness period and then returns back. The fix suppose to account
for a real datapoint value before staleness takes place. The fix should
remove unexpected spikes while using `increase_pure` for staled series.
* dashboard: update single version dash
The update contains the following changes:
* display anonymous memory usage metric. This metric suppose to reflect
memory usage of the process which can't be freed by OS;
* add legends to all panels. This is important for cases when users share
the screenshots;
* modify panels for Grafana v8.0.0
* dashboard: update single version dash tags
* dashboard: update vmagent dash
The update contains the following changes:
* display anonymous memory usage metric. This metric suppose to reflect
memory usage of the process which can't be freed by OS;
* add legends to all panels. This is important for cases when users share
the screenshots;
* modify panels for Grafana v8.0.0
This panic can be raised by the reverseProxy on aborted request to the backend.
So handle it (e.g. suppress) at reverseProxy.ServeHTTP call.
Do not suppress the panic at lib/httpserver generic HTTP handler,
since it may result in an inconsistent state left after the panicking handler.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1353
* vmalert: fix mistake with object reuse while parsing response
During the refactoring, the wrong optimisations was applied in
parse function which caused metric fields reset. The change removes
optimisation.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1369
* vmalert: add test to cover multiple metrics in one response
* vmalert: support rules backfilling (aka `replay`)
vmalert can `replay` configured rules in the past
and backfill results via remote write protocol.
It supports MetricsQL/PromQL storage as data source,
and can backfill data to remote write compatible
storage.
Supports recording and alerting rules `replay`. See more
details in README.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/836
* vmalert: review fixes
* vmalert: readme fixes
* new feature: relabel logging
Use scrape_configs[x].relabel_debug = true to log metric names inkl.
labels before and after relabeling. After relabeling related metrics
get dropped, i.e. not submitted to servers.
* vminsert wants relabel logging, too.
The pool for inmemoryBlock struct doesn't give any performance gains in production workloads,
while it may result in excess memory usage for inmemoryBlock structs inside the pool during
background merge of indexdb.
New flag `-rule.configCheckInterval` defines how often `vmalert` will re-read
config file. If it detects any changes, the config will be reloaded.
This behaviour is turned off by default.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/512
This speeds up bucket32.addBucketAtPos() when bucket32.buckets contains big number of items,
since the copying of bucket16 pointers is much faster than the copying of bucket16 objects.
This is a cpu profile for copying bucket16 objects:
10ms 13.43s (flat, cum) 32.01% of Total
10ms 120ms 650: b.b16his = append(b.b16his[:pos+1], b.b16his[pos:]...)
. . 651: b.b16his[pos] = hi
. 13.31s 652: b.buckets = append(b.buckets[:pos+1], b.buckets[pos:]...)
. . 653: b16 := &b.buckets[pos]
. . 654: *b16 = bucket16{}
. . 655: return b16
. . 656:}
This is a cpu profile for copying pointers to bucket16:
10ms 1.14s (flat, cum) 2.19% of Total
. 100ms 647: b.b16his = append(b.b16his[:pos+1], b.b16his[pos:]...)
. . 648: b.b16his[pos] = hi
10ms 700ms 649: b.buckets = append(b.buckets[:pos+1], b.buckets[pos:]...)
. 330ms 650: b16 := &bucket16{}
. . 651: b.buckets[pos] = b16
. . 652: return b16
. . 653:}
The new setting `extra_filter_labels` may be assigned to group.
If it is, then all rules within a group will automatically filter
for configured labels. The feature is well-described here
https://docs.victoriametrics.com#prometheus-querying-api-enhancements
New setting is compatible only with VM datasource.
Previously the blocked directories were removed sequentially by a single goroutine.
This can be not enough for highly loaded VictoriaMetrics that accepts millions of sample per second,
when big number of LSM parts are created and removed at high rate.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1313
* changes vmalert query function
for prometheus rules compatibility its better to use labels as map.
it simplifies template evaluation and allow to ignore can't evaluate field error
because map will return default value.
fixes https://github.com/VictoriaMetrics/operator/issues/243
These numbers are exposed via the following metrics:
- vmagent_hourly_series_limit_current_series
- vmagent_daily_series_limit_current_series
Expose also the limits via the following metrics:
- vmagent_hourly_series_limit_max_series
- vmagent_daily_series_limit_max_series
The `::tag` type is needed in cases when field and tag names are equal, which
results into unexpected results in InfluxQL. Setting the type explicitly helps
InfluxDB to understand which exact column we apply filter to.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1299
duplicates map helps to determine wheter extra labels has overriden
labels which make time series unique. It was using a sorted hashed
labels sequence as a key. But hashing algorithm could have collisions,
so it is more convenient to not use hashing at all.
Log message for recording rules duplicates was improved as well.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1293
These functions are non-trivial, while their code has minimal differences.
It is better from maintainability PoV to merge these functions into a single function.
It must match all the time series on the given time range.
Previously it was matched to all the time series without the restriction on the given time range.
Dependency updates must be under manual control, since the resulting code diffs must be reviewed manually for the sake of security.
It is done with `make vendor-update` now.
* Add vendor license checker, update codecov action, add dependbot for github actions
* update gitingore, temprorary turn on check
* fix action name
* change action rules to trigger only when vendor changes
* remove obsolete line from main action
Starting from v1.56.0 VM supports `round_digits` which allows to limit
the number of digits after the decimal point in response value. The feature
can be used to reduce entropy of produced by recording rules values
and significantly improve the compression. See more details in link below.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/525
Previously, `startGroup` could exit on restore errors despite the
`remoteRead.ignoreRestoreErrors` flag value. Now vmalert checks the
flag value before deciding whether to return error or just log it.
This should eliminate possible race when an update on endpoints depends on pods and/or services, which are missing in the cache yet.
This could result in missing targets based on endpoints or endpointslices.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1240
Alerting rules now can return specific error type ErrStateRestore to indicate
whether restore state procedure failed. Such errors were returned and logged
before as well. But now user can specify whether to just log these errors
(remoteRead.ignoreRestoreErrors=true) or to stop the process
(remoteRead.ignoreRestoreErrors=false). The latter is important when VM isn't
ready yet to serve queries from vmalert and it needs to wait.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1252
Panics may leave the process in inconsistent state. That's why it is better to stop the process after the panic
instead of recovering from the panic. Unfortunately, the standard net/http.Server recovers panics in request handlers.
See https://github.com/golang/go/issues/16542 . That's lib/httpserver must stop the process on itself after the panic.
* Simplify arguments list for fn `queryDataSource` to improve readbility
* vmalert: adjust `time` param according to rule evaluation interval
With this change, vmalert will start to use rule's evaluation interval
for truncating the `time` param. This is mostly needed to produce consistent
time series with timestamps unaffected by vmalert start time. Now, timestamp
becomes predictable.
Additionally, adjustment is similar to what Grafana does for plotting range graphs.
Hence, recording rule series and recording rule expression plotted in grafana
suppose to become similar in most of cases.
* changes vmalert Querier with per rule querier
it allows to changes some parametrs based on rule setting
for instance - alert type, tenant for cluster version or event endpoint url.
Previously, vmalert used `lastExecTime` timestamp when writing recording rules
to the remote storage. This may be incorrect, if vmalert uses `datasource.lookback` flag,
which means rule's expression will be executed at some moment in the past.
To avoid such situations, vmalert now will use returned timestamp instead of `lastExecTime`.
This should increase block sizes and subsequently increase the maximum possible bandwidth per each connection to remote storage.
This, in turn, should reduce the probability of storing the data in local buffers.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1235
This partially reverts fb82c4b9fa
It has been appeared that the additional memory allocation may result in higher GC pauses.
It is better to spend CPU time on copying bigger bucket16 structs instead of increasing query latencies due to higher GC pauses
* [draft] per tenant statistic
* updates metric name
update graph
adds link and example config
* quick fix
* adds grafana dashboard
adds example alert
Co-authored-by: f41gh7 <nik@victoriametrics.com>
Just like the existing infrastructure for `BUILDINFO_TAG`, this can ease the production of [reproducible builds](https://wiki.freebsd.org/ReproducibleBuilds).
(e.g. in FreeBSD the date the port was committed is used at build time, not the actual build time, so that an identical port produced at different times produces an identical executable)
* docs: drop table of contents for `vmctl`
We already have it autogenerated on .github.io, so no need to keep it.
* docs: mention OpenTSDB migration feature for vmctl
* docs: sync docs for `vmalert`
* add more documentation on OpenTSDB migration explaining what chunking means
* more clarification of OpenTSDB aggregations
* break out what a retention string becomes
* add more docs around retention strings
* add example of running program and fix mistake in how hard offsets are handled
* fix formatting
The major change is adding `sort` directive to docs. For those docs which are copied
from internal packages `sort` is added via makefile command. For the rest it is added
manually since they're updated manually as well.
The rest of changes is connected with markdown formatting. For example, changing headers
in some files (`##` => `#`) makes navigation on .github.io to look better. This especially
useful for `changelog` docs.
Table of contents for `vmctl` is dropped, since we already have it autogenerated on .github.io.
No link changes expected. The corresponding PR to `cluster` branch will be made in follow-up PR.
It should be faster querying all the labels and/or all the values instead of querying per-day labels/values on time ranges exceeding maxDaysForPerDaySearch
Remove async registration of apiWatchers, since it breaks discovering `role: endpoints` and `role: endpointslices` targets,
which depend on pod and service objects.
There is no need in reloading `endpoints` and `endpointslices` targets if the referenced `pod` or `service` objects change,
since in this case the corresponding `endpoints` and `endpointslices` objects should also change because they contain
ResourceVersion of the referenced `pod` or `service` objects, which is modified on object update.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1182
This option can be useful when samples for the same time series are ingested with distinct order of labels.
For example, metric{k1="v1",k2="v2"} and metric{k2="v2",k1="v1"}.
Alert `TooHighChurnRate24h` suppose to cover cases when churn rate
is low but results in multiple times higher number than total
number of active series.
* dashboard: update single node dashboard
* add number of new series created over last 24h;
* bump version requirements.
* dashboard: update vmagent dashboard
* add panel for open file descriptors;
* add panel for disk I/O;
* add panel for `vmagent_remotewrite_packets_dropped_total` metric;
* bump version requirements.
* adds blocks drop at 400 BadRequest status code
recieved from remote storage,
not expected that remote storage will be able to handle it on retry
* removes error logging for dropped blocks,
its expected error
3s evaluation interval is too small for practical setups. It can result in increased load on datasource.
So it is better to remove it from example config args, which are usually copy-pasted by novice users.
This adds the following new metrics for each VictoriaMetrics app:
* process_resident_memory_anonymous_bytes - the RSS share for memory allocated by the process itself.
This share cannot be freed by the OS, so it must be taken into account by OOM killer.
* process_resident_memory_pagecache_bytes - the RSS share for page cache memory (aka memory-mapped files).
This share can be freed by the OS at any time, so it must be ignored by OOM killer.
There was a signficant refactoring in the code responsible for time series search,
so it can result in both speed ups and slow downs depending on used queries.
The archive contains the following executables for Windows:
* vmagent
* vmalert
* vmauth
* vmctl
Other components - vmbackup, vmrestore, victoria-metrics - aren't supported for Windows yet
Do not pass filter metric ids to getMetricIDsForTagFilter, since it has been appeared that this slows down
the function by multiple times when it finds big number of metricIDs (tens of millions).
* dashboard: update single node dashboard
* add panel `Open FDs` for file descriptors metrics;
* add panel `Disk writes/reads` to show the real read/write
load on storage layer;
* add `process_resident_memory_bytes` metric to memory usage panel;
* add stats panel to show available CPUs, memory and disk space;
* rm flags panel since it didn't prove its usefulness.
* alerts: add alert for reaching FDs limit
Do not cache too big byte buffers and too big writeRequestCtx objects,
since it is cheaper to re-create them instead of wasting RAM for their caching.
This reverts 7f6f350ee1
Previously multiple scrape jobs could create multiple watchers for the same apiURL. Now only a single watcher is used.
This should reduce load on Kubernetes API server when many scrape job configs use Kubernetes service discovery.
See the description for `sample_limit` option from Prometheus docs:
Per-scrape limit on number of scraped samples that will be accepted.
If more than this number of samples are present after metric relabeling
the entire scrape will be treated as failed. 0 means no limit.
https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config
Serialize reloading per-role objects, so they don't occupy too much memory when objects for many scrape jobs are simultaneously refreshed.
Do not reload per-role objects if they were already refreshed by concurrent goroutines. This should reduce load on Kubernetes API server
when big number of scrape jobs are configured for the same Kubernetes role.
This is a follow-up for 17b87725ed
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1113
Previously vmagent was creating a separate Kubernetes object cache per each scrape job.
This could result in increased memory usage when monitoring a Kubernetes cluster with big number of objects (pods / nodes / services, etc.)
as seen at https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1113
Now it uses a shared map of scrape objects across multiple scrape jobs.
* fixes windows compilation,
adds signal impl for windows,
adds free space usage for windows,
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/70https://github.com/VictoriaMetrics/VictoriaMetrics/issues/1036
NOTE victoria metrics database still CANNOT work under windows system,
only vmagent is supported.
To completly port victoria metrics, you have to fix issues with separators,
parsing and posix file removall
* rollback separator
* Adds windows setInformation api,
it must behave like unix, need to test it.
changes procutil
* check for invlaid param
* Fixes posix delete semantic
* refactored a bit
* fixes openbsd build
* removed windows api call
* Fixes code after windows add
* Update lib/procutil/signal_windows.go
Co-authored-by: Aliaksandr Valialkin <valyala@gmail.com>
The first and the last buckets are usually `[0 ... leMin]` and `(leMax ... +Inf)`. If they are merged with adjancent buckets,
then the resulting accuracy can suffer.
Main points:
* Revert changes outside lib/promscrape/discovery/kuberntes . These changes can be applied later in a separate commit
* Minimize changes in lib/promscrape/discovery/kubernetes compared to a93e644001
* Corner case fixes.
* started work on sd for k8s
* continue work on watch sd
* fixes
* continue work
* continue work on sd k8s
* disable gzip
* fixes typos
* log errror
* minor fix
Co-authored-by: Aliaksandr Valialkin <valyala@gmail.com>
The v1.15.0 exports the following additional metrics:
process_io_read_bytes_total - the number of bytes read via io syscalls such as read and pread
process_io_written_bytes_total - the number of bytes written via io syscalls such as write and pwrite
process_io_read_syscalls_total - the number of read syscalls such as read and pread
process_io_write_syscalls_total - the number of write syscalls such as write and pwrite
process_io_storage_read_bytes_total - the number of bytes read from storage layer
process_io_storage_written_bytes_total - the number of bytes written to storage layer
These metrics can be used for monitoring process io
While this may increase CPU and disk IO usage needed for background merge,
this also recudes CPU usage during queries in production. This is because
such queries tend to read recently added data and it is better to have lower number
of parts for such data in order to reduce CPU usage.
This partially reverts ebf8da3730
The filter arg has been removed in the commit c7ee2fabb8
because it was preventing from caching the number of matching time series per each tf.
Now the cache contains duration for tf execution, so the filter shouldn't break such caching.
For example `{label=~"foo\.bar"}` should be converted to `{label="foo.bar"}`. Previously it has was mistakenly conveted to `{label="foo\.bar"}` .
This could result in missing time series for such tag filters.
These metrics may result in big number of time series when vmagent scrapes thousands of targets and these targets constantly changes.
* It is better using `up == 0` query for determining failing targets.
* It is better using the following query for determining targets with exceeded limit on the number of metrics:
scrape_samples_scraped > 0 if up == 0
The `filter` arg breaks the logic for sorting tag filters by the matching metrics,
which may result in non-optimal performance during time series search.
Production workloads show that indexdb blocks must be cached unconditionally for reducing CPU usage.
This shouldn't increase memory usage too much, since unused blocks are removed from the cache every two minutes.
* init implementation for graphite alerts
* adds graphite support for vmalert
* small fix
* changes vmalert graphite api with type
* updates tests
* small fix
* fixes graphite parse
* Fixes graphite from time
It is better developing vmctl tool in VictoriaMetrics repository, so it could be released
together with the rest of vmutils tools such as vmalert, vmagent, vmbackup, vmrestore and vmauth.
These metrics could be useful for determining imporperly working scrape targets.
Note that these metrics are exported only for failing scrape targets. They aren't exposed for normally working targets.
* Include individual binary checksums for vmutils
* Consistent archive/binary artefacts between arm64/amd64 for vmutils
* architecture in arhcive, checksums
* not in binaries
* adds extra_label to all import apis,
changes priority for extra_label - now it has priority over original labels
* Update README.md
Co-authored-by: Aliaksandr Valialkin <valyala@gmail.com>
* Update README.md
Co-authored-by: Aliaksandr Valialkin <valyala@gmail.com>
* adds extra labels to vmagent import api
changes order for adding labels, now its added after user values
* adds tests for extra_label
* import fix
Co-authored-by: Aliaksandr Valialkin <valyala@gmail.com>
List contains examples for the alerting rules which might be executed
via `vmalert` to track the health state of VM components. It is assumed
that list will be revised and calibrated for each system individually.
On templates validation stage vmalert does not acutally send queries, so for complex
chained expression validation may fail. To avoid this, we add a blank sample in response
so validation can pass successfully. Later, during the rule execution, stub will be replaced
with real `query` function.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/989
Currently, alertmanager spams logs with `Notify attempt failed, will retry later` message
because default receiver is unreachable. The change updates default configuration with
blackhole receiver which means alertmanager will continue to accept alerts but won't make
attempts to send them anywhere.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/995
* adds snap docs,
adds release information for snap package,
adds docs notes about configuration management with snap package.
* adds release page mention
* version fix for snap, its awful
* revert version
* Apply suggestions from code review
Co-authored-by: Aliaksandr Valialkin <valyala@gmail.com>
Previously it could be modified in order to improve response cache hit ratio.
This is unneeded, since cache hit ratio should remain good because the query time range
should be already aligned to multiple of `step` values.
See https://github.com/VictoriaMetrics/VictoriaMetrics/issues/976
* Adds query stat handler,
for query and query_range api, victoriametrics tracks query execution time,
stats are expored at /api/v1/status/queries endpoint with topN param
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/907
* fixed query stats bugs
* improves queryStats tracker
* improves query stat
* small fix
* fix tests
* added more tests
* fixes 386 tests
* naming fixes
* adds drop for outdated records
* adds proxy_url support,
adds proxy_url to the dockerswarm, eureka, kubernetes and consul service discovery,
adds proxy_url to the scrape_config for targets scrapping,
http based proxy is supported atm,
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/503
* fixes imports
Previously such parts could remain undeleted for long durations until they are merged with other parts.
This should help for `-retentionPeriod` values smaller than one month.
It is unclear why `go install` doesn't work in Github Actions. Needs additional investigation.
The following error is returned now:
cannot find package "golang.org/x/lint/golint" in any of:
/opt/hostedtoolcache/go/1.15.5/x64/src/golang.org/x/lint/golint (from $GOROOT)
/home/runner/go/src/golang.org/x/lint/golint (from $GOPATH)
* adds ArrayDuration and ArrayBool flags,
makes sendTimeout and tlsInsecure configurable per remoteWrite url
* added backward compatibility testcases for ArrayDuration and ArrayBool
* fixes bool flag
* fixes test cases
The commit adds a support for template function `query`,
`first` and `value`. The function `query` executes
a MetricsQL query for active alerts. In vmalert we
update templates on every evaluation for active alerts
to keep them up to date. With `query` func it may become
a perf issue since it will fire a query on every execution.
We should keep it in mind for now.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/539
This is a preparation for Go 1.16, which deprecates `go get` for installing binaries.
See https://tip.golang.org/doc/go1.16#go-command :
go install, with or without a version suffix (as described above), is now the recommended way
to build and install packages in module mode. go get should be used with the -d flag to adjust
the current module's dependencies without building packages, and use of go get to build and install
packages is deprecated. In a future release, the -d flag will always be enabled.
This diff is just to suggest wording to let people know there is no future-compatible guaranteed way to make their own native format files for import yet.
This should prevent from possible 'memory leaks' when a pointer to ScrapeWork item stored in the slice
could prevent from releasing memory occupied by all the ScrapeWork items stored in the slice when they
are no longer used.
See the related commit e205975716 and the related issue https://github.com/VictoriaMetrics/VictoriaMetrics/issues/825
* adds consul watch api,
it must reduce load on consul service with blocking wait requests,
changed discoveryClient api with fetchResponseMeta callback.
* small fix
* fix after master merge
* adds watch client at discovery utils
* fixes consul watcher,
changes namings,
fixes data race
* small typo fix
* sanity fix
* fix naming and service node update
OpenMetrics timestamps are floating-point numbers, that represent Unix timestamp in seconds.
This differs from Prometheus exposition format, where timestamps are integer numbers representing Unix timestamp in milliseconds.
All the callers for fs.OpenReaderAt expect that the file will be opened.
So it is better to log fatal error inside fs.MustOpenReaderAt instead of leaving this to the caller.
It is recommended to have at least of 50% of free RAM on vmstorage nodes in order handle possible
RAM usage spikes during rolling upgrade for vmstorage nodes when time series
are re-routed from temporarily unavailable node to the remaining active nodes.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/891
This commit also adds `"isPartial":{true|false}` field to `/api/v1/*` responses. `"isPartial":true` is set when the response
is based on a partial data because some of vmstorage nodes weren't available during query processing.
* Add omitempty for DisableCompression and DisableKeepAlive fields in ScrapeConfig
* Add omitempty annotation to all the default/optional values
* Fix annotations after review
This should prevent from holding previously discovered []ScrapeWork slices when a part of discovered targets changes over time.
This should reduce memory usage for the case when big number of discovered scrape targets changes over time.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/825
The previous implementation treated extra labels (global and rule labels) as
separate label set to returned time series labels. Hence, time series always contained
only original labels and alert ID was generated from sorted labels key-values.
Extra labels didn't affect the generated ID and were applied on the following actions:
- templating for Summary and Annotations;
- persisting state via remote write;
- restoring state via remote read.
Such behaviour caused difficulties on restore procedure because extra labels had to be dropped
before checking the alert ID, but that not always worked. Consider the case when expression
returns the following time series `up{job="foo"}` and rule has extra label `job=bar`.
This would mean that restored alert ID will be always different to the real time series because
of collision.
To solve the situation extra labels are now always applied beforehand and `vmalert` doesn't
store original labels anymore. However, this could result into a new error situation.
Consider the case when expression returns two time series `up{job="foo"}` and `up{job="baz"}`,
while rule has extra label `job=bar`. In such case, applying extra labels will result into
two identical time series and `vmalert` will return error:
`result contains metrics with the same labelset after applying rule labels`
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/870
* adds leading forward slash check for scrapeURL path
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/835
* adds ready probe for scrape config initialization,
it should prevent metrics loss during vmagent rolling update,
/ready api will return 425 http code, if some scrape config still waits for initialization.
* updates docs
* Update app/vmagent/README.md
* renames var
* Update app/vmagent/README.md
Co-authored-by: Aliaksandr Valialkin <valyala@gmail.com>
Previously such blocks were cleaned after they weren't accessed during 10 minutes.
Now they are cleaned after one minute of missing access. This should reduce memory usage in general case.
* add short_version label to vm_app_version metric
use case: Version panel of Grafana dashboard should use a live query, but currently it uses a template query which becomes stale. Grafana is not able to preform regex substitution on labels.
* Update metrics.go
* fix compile
* dashboard: add `Storage full ETA` panel
The new panel suppose to help to estimate the time needed to run out of free
disk space.
Thx to @belm0 @hekmon
* disable legend for `Storage full ETA` panel
Label `alertgroup` was introduced in #611 and automatically added to generated
time series. By mistake, this new label wasn't correctly purged on restore event
and affected alert's ID uniqueness. This commit removes `alertgroup` label
in restore function.
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/870
The new release of github.com/VictoriaMetrics/metricsql adds more optimizations for `foo{filters1} op bar{filters2}`:
* rollup_func(foo[d]) op bar{filters}
* transform_func(foo) op bar{filters}
* num_or_scalar op bar op baz{filters}
This should prevent from double counting for time series at the time when it changes label.
The most common case is in K8S, which changes pod uid label with each new deployment.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/748
This reverts commit ac45082216.
Reason for revert: the previous behavior for VictoriaMetrics is easier to understand and use by users -
functions, which don't change the meaning of the time series shouldn't drop metric name.
Now the following functions do not drop metric names:
* ceil
* floor
* round
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/674
This reverts commit bb61a4769b.
Reason for revert: the previous behavior for VictoriaMetrics is easier to understand and use by users -
functions, which don't change the meaning of the time series shouldn't drop metric name.
Now the following functions do not drop metric name:
* clamp_min
* clamp_max
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/674
This reverts commit e5202a4eae.
Reason for revert: the previous behavior for VictoriaMetrics is easier to understand and use by users -
functions, which don't change the meaning of the time series shouldn't drop metric name.
Now the following functions do not drop metric name:
* max_over_time
* min_over_time
* avg_over_time
* quantile_over_time
* geomean_over_time
* mode_over_time
* holt_winters
* predict_linear
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/674
Show also original labels for duplicate targets in error message in order to simplify debugging the issue.
Now `/targets` endpoint accepts optional `show_original_labels=1` query arg, which shows original labels for each target.
This may simplify debugging for target relabeling.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/651
Remove the code that uses metricIDs caches for the current and the previous hour during metricIDs search,
since this code became unused after implementing per-day inverted index almost a year ago.
While at it, fix a bug, which could prevent from finding time series with names containing dots (aka Graphite-like names
such as `foo.bar.baz`).
Previously the maximum cache lifetime has been limited by 10 seconds. Now it is extended up to a day.
This should reduce CPU usage in the following cases:
* when querying recently added data with small churn rate for time series
* when querying historical data
This reverts commit bab6a15ae0.
Reason for revert: the `fetchData` arg is used in cluster branch.
Leaving this arg in master branch makes smaller the diff with cluster branch.
* Add improvements to ec2 discovery
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/771
role_arn support with aws sts
instance iam_role support
refreshing temporary tokens
* Apply suggestions from code review
Co-authored-by: Roman Khavronenko <hagen1778@gmail.com>
* changed implementation, removed tests, clean up code
* moved endpoint builder into getEC2APIResponse
Co-authored-by: Roman Khavronenko <hagen1778@gmail.com>
The strategy is:
- Periodical flushing of inmemory blocks to files, so they aren't lost on unclean shutdown.
- Periodical syncing of metadata for persisted queues, so the metadata remains in sync with the persisted data.
- Automatic adjusting of too big chunk size when opening the queue. The chunk size may be bigger than the writer offset after unclean shutdown.
- Skipping of broken chunk file if it cannot be read.
- Fsyncing finalized chunk files.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/687
Previously certain errors in timestamps and/or values could be silently skipped,
which could lead to samples with zero values stored in the database.
Updates https://github.com/VictoriaMetrics/vmctl/issues/25
On config reload event `vmalert` reloads configuration for every group. While
it works for simple configurations, the more complex and heavy installations may
suffer from frequent config reloads.
The change introduces the `checksum` field for every group and is set to md5 hash
of yaml configuration. The checksum will change if on any change to group
definition like rules order or annotation change. Comparing the `checksum` field
on config reload event helps to detect if group should be updated.
The groups update is now done concurrently, so reload duration will be limited by
the slowest group now.
Partially solves #691 by improving config reload speed.
This is similar to https://github.com/prometheus/prometheus/pull/6838 , which will be added in Prometheus v2.21.
See https://github.com/prometheus/prometheus/releases/tag/v2.21.0-rc.1
* Added endpointslices discovery to k8s api
Started from 1.17 k8s version endpointslices is beta,
it allows to query k8s api for endpoints more efficient.
It presents at scrape_config.yaml as separate role for kubernetes_sd_config.
kubernetes_sd_config:
- role: endpointslices
* fixed typos, changed EndpointConditions signature - with values instead of pointers
This should reduce disk space usage when scraping targets containing metrics with identical names
such as `node_cpu_seconds_total`, histograms, quantiles, etc.
Expose `vm_timestamps_blocks_merged_total` and `vm_timestamps_bytes_saved_total` metrics for monitoring
the effectiveness of timestamp blocks merging.
* revise /api/v1/series docs
Further clarification for #735
* clarify how default range differers from Prometheus API
* avoid `start=0` suggestion when confirming delete, because it will cause a timeout in most deployments
* Update README.md
aws sdk has complicated logic for chosing profile name and we shouldn't set
it to `default` value. It leads to bugs and improper configuration.
Set it to empty value by default is safe. It will be automatically set to `default` by sdk.
These functions returns the number of raw samples that don't exceed `le` or are bigger than `gt`.
These functions are complement to already existing `share_le_over_time(m[d], le)` and `share_gt_over_time(m[d], gt)`.
* VMAlert start with empty rules dir
There are some applications (operator for instance), that generates alerts configuration at runtime
and vmalert must start correctly without rules to support this behaviour.
Later application will add rules files and send SIGHUP to vmalert,
which will trigger reading rules files and start rules exectuion.
Removing rules files with SIGHUP signal must stop rules execution and
vmalert will wait for new rules.
* imports sorted
* added test cases for empty rules, removed blank line
* fixed imports conflict
* updated tests
* feat: spread load of rule evaluation by group when starting new groups
* review: reduce the resulting diff.
* Update app/vmalert/group.go
Co-authored-by: Roman Khavronenko <hagen1778@gmail.com>
Co-authored-by: Aliaksandr Valialkin <valyala@gmail.com>
Co-authored-by: Roman Khavronenko <hagen1778@gmail.com>
The previous basis on `cap(sw.labels)` doesn't work anymore after 7785869ccc ,
because `sw.labels` may be reset multiple times when processing big number of rows.
* dashboard: rename var `datasource` to `ds` for consistency reason
Dasbhoards for cluster version or vmagent operate with datasource variable
named `ds`. For consistency sake we rename this variable in single node version
as well.
* dashboard: add instance variable picker
See dashboard reviews here https://grafana.com/grafana/dashboards/10229/reviews
* dashboard: limit number of buckets in histogram to 12 for vmagent dashboard
* dashboard: bump version requirement in description for single version
* dashboard: drop extra series override for single version
* dashboard: set Y-min to zero for most of panels in vmagent dashboard
Description for `-rule` flag uses as example specific chars like asterisks
which could be interpreted wrong by different shells. To avoid this, description
now contains quoted flag values.
See also #708
Arch 386 is a 32-bit architecture and interprets int type for numbers as an explicit int32,
whereas on most modern CPUs int is implicitly an int64. This makes tests to fail with
`int overflow` error.
Previously targets with big number of metrics and/or labels could generated too big buffers,
which then could be re-used when scraping targets with small number of metrics.
This resulted in memory waste.
Now big buffers are used only for targets with big number of metrics / labels,
while small buffers are used for targets with small number of metrics / labels.
The previous notion was inconsistent with what `decimal.Round` does.
According to [wiki](https://en.wikipedia.org/wiki/Significant_figures) rounding
applied to all significant figures, not just decimal ones.
* app/vmalert: extend metrics set exported by `vmalert` #573
New metrics were added to improve observability:
+ vmalert_alerts_pending{alertname, group} - number of pending alerts per group
per alert;
+ vmalert_alerts_acitve{alertname, group} - number of active alerts per group
per alert;
+ vmalert_alerts_error{alertname, group} - is 1 if alertname ended up with error
during prev execution, is 0 if no errors happened;
+ vmalert_recording_rules_error{recording, group} - is 1 if recording rule
ended up with error during prev execution, is 0 if no errors happened;
* vmalert_iteration_total{group, file} - now contains group and file name labels.
This should improve control over specific groups;
* vmalert_iteration_duration_seconds{group, file} - now contains group and file name labels. This should improve control over specific groups;
Some collisions for alerts and recording rules are possible, because neither
group name nor alert/recording rule name are unique for compatibility reasons.
Commit contains list of TODOs for Unregistering metrics since groups and rules
are ephemeral and could be removed without application restart. In order to
unlock Unregistering feature corresponding PR was filed - https://github.com/VictoriaMetrics/metrics/pull/13
* app/vmalert: extend metrics set exported by `vmalert` #573
The changes are following:
* add an ID label to rules metrics, since `name` collisions within one group is
a common case - see the k8s example alerts;
* supports metrics unregistering on rule updates. Consider the case when one rule
was added or removed from the group, or the whole group was added or removed.
The change depends on https://github.com/VictoriaMetrics/metrics/pull/16
where race condition for Unregister method was fixed.
Previously the limit has been raised to GOMAXPROCS, but it has been appeared that this
increases query latencies since more CPUs are busy with merges.
While at it, substitute `*MergeConcurrencyLimitCh` channels with simple integer limits.
Previously the `vm_slow_row_inserts_total` metric may be incremented multiple times for different data points per a single time series,
while only a single increment is needed when inserting the first data point for this time series.
`external.label` flag supposed to help to distinguish alert or recording rules
source in situations when more than one `vmalert` runs for the same datasource
or AlertManager.
due to memory align the metaindexRow structure use 64-byte pre object.
this commit changes the order of field, make metaindexRow use 56-byte pre
object.
Signed-off-by: Sasasu <su@sasasu.me>
This function works with both Prometheus-style and VictoriaMetrics-style buckets.
The function removes buckets with the lowest values in order to reserve the highest precision.
The function is useful for building heatmaps in Grafana from too big number of buckets.
Previously the time spent on inverted index search could exceed the configured `-search.maxQueryDuration`.
This commit stops searching in inverted index on query timeout.
Previously the enabled relabeling with `-relabelConfig` command-line flag could result in missing labels
if a single Influx line protocol message contains multiple field values.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/638
This condition may occur after the following sequence of events:
1) A goroutine enters the loop body when len(addRowsConcurrencyCh) == cap(addRowsConcurrencyCh) inside Storage.searchTSIDs.
2) All the goroutines return from Storage.AddRows.
3) The goroutine from step 1 blocks on searchTSIDsCond.Wait() inside the loop body.
The goroutine remains blocked until the next call to Storage.AddRows, which calls searchTSIDsCond.Signal().
This may take indefinite time.
This should prvent from inflated `increase()` results for time series that start from big initial values.
Such cases may occur when a label value changes in a metric without counter reset.
`vmagent` Grafana dashboard suppose to provide basic observability over multiple
`vmagent` instances. Dashboard is saved in Grafana export format so it can be easily
imported. It was also integrated into docker-compose environment.
This is a follow-up commit after 12b16077c4 ,
which didn't reset the `tsidCache` in all the required places.
This could result in indefinite errors like:
missing metricName by metricID ...; this could be the case after unclean shutdown; deleting the metricID, so it could be re-created next time
Fix this by resetting the cache inside deleteMetricIDs function.
* add vm_protoparser_rows_read_total metrics to promscrape
move vm_protoparser_rows_read_total for promscrape to better place
move vm_protoparser_rows_read_total for promscrape to better place
* remove possibility of infinity loop at prometheus parser
Array type flag is now defined as `value` type in flag description when printed.
This change adds additional description to every Array type flag so it would be
clear what exact type is used:
```
-remoteWrite.urlRelabelConfig array
Optional path to relabel config for the corresponding -remoteWrite.url
Supports array of values separated by comma or specified via multiple flags.
```
Metric `vm_persistentqueue_bytes_pending` is a gauge that shows current amount
of bytes in persistentqueue flushed on disk as a difference between write and read
offsets. This metric is very similar to `vmagent_remotewrite_pending_data_bytes`
except of accounting for bytes in-memory.
The change to `vm_promscrape_targets` metric suppose to improve observability
for `vmagent` so it will be possible to track how many targets are up or down
for every specific scrape group:
```
vm_promscrape_targets{type="static_configs", status="down"} 1
vm_promscrape_targets{type="static_configs", status="up"} 2
```
Previously multiple goroutines could access remoteWriteCtx.tss concurrently, which could lead to data race
and improper relabeling. Now each goroutine has its own copy of tss during relabeling.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/599
Previously the concurrency has been limited to GOMAXPROCS*2. This had little sense,
since every call to Storage.AddRows is bound to CPU, so the maximum ingestion bandwidth
is achieved when the number of concurrent calls to Storage.AddRows is limited to the number of CPUs,
i.e. to GOMAXPROCS.
Heavy queries could result in the lack of CPU resources for processing the current data ingestion stream.
Prevent this by delaying queries' execution until free resources are available for data ingestion.
Expose `vm_search_delays_total` metric, which may be used in for alerting when there is no enough CPU resources
for data ingestion and/or for executing heavy queries.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/291
* app/vmalert: add retries to remotewrite
Remotewrite pkg now does limited number of retries if write request failed.
This suppose to make vmalert state persisting more reliable.
New metrics were added to remotewrite in order to track rows/bytes sent/dropped.
defaultFlushInterval was increased from 1s to 5s for sanity reasons.
* fix
* wip
* wip
* wip
* fix bits alignment bug for 32-bit systems
* fix mistakenly dropped field
The maximum deadline miss duration is reduced to 2x scrape_interval in the worst case.
By default it is limited to scrape_interval configured for the given scrape target.
* Fix Auto metrics relabeled errors
* Finalize auto-genenated Labels
* Fix Test Errors
* fix error logs when queue is full
Co-authored-by: xinyulong <xinyulong@kuaishou.com>
VictoriaMetrics returns `503 Service Unavailable` http error for requests with time ranges outside the configured retention
if `-denyQueriesOutsideRetention` command-line flag is set.
* app/vmalert: support multiple notifier urls (#584)
User now can set multiple notifier URLs in the same fashion
as for other vmutils (e.g. vmagent). The same is correct for
TLS setting for every configured URL. Alerts sending is done
in sequential way for respecting the specified URLs order.
* app/vmalert: add basicAuth support for notifier client (#585)
The change adds possibility to set basicAuth creds for notifier
client in the same fasion as for remote write/read and datasource.
vmagent replaces Prometheus to perform scrapes and writes
into VictoriaMetrics installation. Prometheus datasource was
dropped, but its config was reused to feed vmagent.
Change also contains simplification in dashboard propagation
to Grafana container by removing excessive json manipulation
steps.
The change adds no new functionality and aims to move flags definitions
to subpackages that are using them. This should improve readability
of the main function.
As per documentation on `label_replace` function: "If the regular
expression doesn't match then the timeseries is returned unchanged".
Currently this behavior is not enforced, if a regexp on an existing
tag doesn't match then the tag value is copied as-is in the destination
tag. This fix first checks that the regular expression matches the
source tag before applying anything.
Given the current implementation, this fix also changes the behavior
of the **MetricsQL** `label_transform` function which does not
document this behavior at the moment.
Previously VictoriaMetrics was processing up to 32 time series in a single goroutine.
This could be slow if each time series contains big number of data points (10M+ or more), since only a single CPU core could be loaded with work,
while other CPU cores were idle. Fix this by launching GOMAXPROCS workers for time series processing.
This should help with https://github.com/VictoriaMetrics/VictoriaMetrics/issues/572
app/vmalert: add support for TLS configuration
Add support for TLS optional configuration in a similar fashion to what
is currently supported in other vmutils such as vmagent. TLS
configuration options are distinct for datasource, remoteRead,
remoteWrite as well as notifier.
These actions may be useful for filtering out unneeded targets and/or metrics if they contain equal label values.
For example, the following rule would leave the target only if __meta_kubernetes_annotation_prometheus_io_port
equals __meta_kubernetes_pod_container_port_number:
- action: keep_if_equal
source_labels: [__meta_kubernetes_annotation_prometheus_io_port, __meta_kubernetes_pod_container_port_number]
app/vmalert: Support custom URL for alerts source
Add flag `external.alert.source` for configuring custom URL
for alert's source. This may be handy to re-point default source
URL to other systems like Grafana.
Updates #517
Uniqueness of rule is now defined by combination of its name, expression and
labels. The hash of the combination is now used as rule ID and identifies rule within the group.
Set of rules from coreos/kube-prometheus was added for testing purposes to
verify compatibility. The check also showed that `vmalert` doesn't support
`query` template function that was mentioned as limitation in README.
The feature allows to speed up group rules execution by
executing them concurrently.
Change also contains README changes to reflect configuration
details.
* vmalert: Add recording rules support.
Recording rules support required additional service refactoring since
it wasn't planned to support them from the very beginning. The list
of changes is following:
* new entity RecordingRule was added for writing results of MetricsQL
expressions into remote storage;
* interface Rule now unites both recording and alerting rules;
* configuration parser was moved to separate package and now performs
more strict validation;
* new endpoint for listing all groups and rules in json format was added;
* evaluation interval may be set to every particular group;
* vmalert: uncomment tests
* vmalert: rm outdated TODO
* vmalert: fix typos in README
v1.36.0 always returns empty responses for Graphite wildcards like the following
{__name__=~"foo\\.[^.]*\\.bar\\.baz"}
Temporary workaround for v1.36.0 is to add `[^.]*` to the end of the regexp.
Add index for reverse Graphite-like metric names with dots. Use this index during search for filters
like `__name__=~"foo\\.[^.]*\\.bar\\.baz"` which end with non-empty suffix with dots, i.e. `.bar.baz` in this case.
This change may "hide" historical time series during queries. The workaround is to add `[.]*` to the end of regexp label filter,
i.e. "foo\\.[^.]*\\.bar\\.baz" should be substituted with "foo\\.[^.]*\\.bar\\.baz[.]*".
* docs/Single-server-VictoriaMetrics.md: add link to Wiki page so it may get more attention
* docs/Single-server-VictoriaMetrics.md: mention case for changing `-retentionPeriod` setting
* Slow metrics load panel was removed since it is hard to interpret without
additional metrics and stats;
* Slow inserts panel was updated to display percentage of slow inserts comparing
to total number of inserts to show the real impact.
The new update introduces new row "Troubleshooting" that
contains panels for churn rate and slow-queries/inserts/loads metrics. This row supposed to be reveal the cause of low performance or other issues.
Panels for storage were updated with "bytes-per-datapoint" and "remaining disk size" panels.
Before the change we were sending notifications to notifier
if following conditions are met:
* alert is in Fire state
* alert is in Inactive state
We were sending Inactive notifications to resolve alert ASAP.
Unfortunately, we were sending resolves for Pending alerts that become
Inactive, which is wrong.
In this change we delete alert from the active list if
it was Pending and become Inactive. In this way we now
have Inactive alerts only if they were in state Fire before.
See test change for example.
During group's update rules deletion was causing slice
mutations while slice index was assumed to be unchanged.
This caused "slice bounds out of range" errors when multiple
rules were deleted sequentially.
The check for non-nil remoteRead was mistakenly dropped
during refactoring which caused panics when `vmalert`
wasn't configured with `remoteRead` flag.
* feat: add remoteWrite.maxQueueSize to reduce queue full
* rename remote(write|read) flags to remote(Write|Read) for the sake of consistency
Co-authored-by: xiaobeibei <xiaobeibei@bigo.sg>
The change introduces new entity `manager` which replaces
`watchdog`, decouples requestHandler and groups. Manager
supposed to control life cycle of groups, rules and
config reloads.
Groups export an ID method which returns a hash
from filename and group name. ID supposed to be unique
identifier across all loaded groups.
Some tests were added to improve coverage.
Bug with wrong annotation value if $value is used in
templates after metrics being restored fixed.
Notifier interface was extended to accept context.
New set of metrics was introduced for config reload.
The http server returns 503 non-OK error at `/health` page during grace period,
so load balancers in front of the http server could re-route incoming requests
to other servers.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/463
Previously the duration for graceful shutdown for http server could take more than a minute
because of imporperly set timeouts in setNetworkTimeout.
Now typical duration for graceful shutdown should be reduced to less than 5 seconds.
* app/vmalert: restore alerts state from datasource metrics
Vmalert will restore alerts state for rules that have `rule.For` > 0 from previously written timeseries via `remotewrite.url` flag.
* app/vmalert: mention remotewerite and remoteread configuration in README
The upstream fasthttp may contain issues like 996610f021 ,
plus a code that isn't used by VictoriaMetrics. So let's use a private copy under our control instead.
Newly added index entries can be missing after unclean shutdown, since they didn't flush to persistent storage yet.
Log about this and delete the corresponding metricID, so it could be re-created next time.
* app/vmalert: initial remote-write support for alerts state persistence.
If `remotewrite.url` flag is set, vmalert will send alerts state via remote-write protocol to remote storage. The sending is asynchronous to avoid blocking calls in rules evaluation loop.
* app/vmalert: merge with master
* app/vmalert: write both `instant` and `for` alerts timeseries states in remote storage.
This eliminates the need for storing block data into temporary files on a single-node VictoriaMetrics
during heavy queries, which touch big number of time series over long time ranges.
This improves single-node VM performance on heavy queries by up to 2x.
Now it leaves only the first data point on each `-dedup.minScrapeInterval` interval.
Previously it may leave two data points on the interval. This could lead to unexpected results
for `histogram_quantile(phi, sum(rate(buckets)) by (le))` query.
This should reduce the frequency of the following errors:
cannot find tag filter matching less than N time series; either increase -search.maxUniqueTimeseries or use more specific tag filters
more than N time series found on the time range [...]; either increase -search.maxUniqueTimeseries or shrink the time range
This should fix the following error when sending data to remote storage:
couldn't send a block with size XX bytes to "YYY": the server closed connection before returning the first response byte. Make sure the server returns 'Connection: close' response header before closing the connection
This should fix the following error:
the server closed connection before returning the first response byte. Make sure the server returns 'Connection: close' response header before closing the connection
This case is possible when the corresponding metricID->metricName entry didn't propagate to inverted index yet.
This should fix the following error:
error when searching tsids for tfss [...]: cannot find metricName by metricID 1582417212213420669: EOF
This is a common error whith improperly configured target autodiscovery and/or relabeling.
This error leads to duplicate scraping of the same targets with the same set of labels, which leads
to duplicate samples in time series.
Previously targets with duplicate scrape urls were merged into a single line on the page.
Now each target with duplicate scrape url is displayed on a separate line.
This should reduce CPU load during scraping when target discovery generates
big number of `__meta_*` labels (for instance, k8s discovery).
See https://www.robustperception.io/life-of-a-label for details.
vmalert: Extend web responses for alerts
* populate apiAlert object with additional fields
* return all active alerts, not only firing
* sort list of API alerts for deterministic output
* add helper for available path list
This should prevent from cache stats counters going down after cache rotation,
which may corrupt `cache hit ratio` graph on the official Grafan dasbhoards
when using the following query:
1 - (sum(rate(vm_cache_misses_total[5m])) by (type) / sum(rate(vm_cache_requests_total[5m])) by (type))
* Initial rules evaluation support.
Rules are now store alerts state in private field `alerts`. Every evaluation updates
the alerts and state. Every unique metric received from datastore represents a unique alert,
uniqueness is guaranteed by hashing ordered labelset.
* merge with master
* cleanup
* support endAt parameter as 3*evaluationInterval for active alerts
* make golint happy
- Sort tag filters in the ascending number of matching time series
in order to apply the most specific filters first.
- Fall back to metricName search for filters matching big number of time series
(usually this are negative filters or regexp filters).
If docker app is upgraded from root to non-root, then the data pointed by `-storageDataPath` or similar flags
becomes denied to non-root user after the upgrade. This breaks upgrade path. So revert back to default root user
for docker apps.
Users may explicitly execute `docker run --user <non_root_user>` for running docker apps under non-root user.
The way how regex for column style in Table panel should be applied has changed in 6.7 Grafana version. The change supposed to fix Flags panel column styles accordingly.
* Remember the last accessed bucket on Has() call.
* Inline fast paths inside Add() and Has() calls.
* Remove fragile code with maxUnsortedBuckets inside bucket32.
This guarantees that the snapshot contains all the recently added data
from inmemory buffers when multiple concurrent calls to Storage.CreateSnapshot are performed.
The following corner cases now supported:
* label_map(q, "label", "", "foo") - adds `label="foo"` to series with missing `label`
* label_map(q, "label", "foo", "") - removes `label="foo"` from series
All the unmatched labels are kept unchanged.
Prometheus docs at https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config say:
> In communication with external systems, they are always applied only
> when a time series does not have a given label yet and are ignored otherwise.
Though this may result in consistency chaos when scrape targets override `external_labels`,
let's stick with Prometheus behavior for the sake of backwards compatibility.
There is last resort in vmagent with `-remoteWrite.label`, which consistently
sets the configured labels to all the metrics before sending them to remote storage.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/366
This should prevent from the following unexpected side-effects of idempotent request retries:
- increased actual timeout when scraping the target comparing to the configured scrape_timeout
- increased load on the target
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/357
This flag can be used for removing gaps on graphs if the difference between the current time
and the timestamps from the ingested data exceeds 5 minutes.
This is the case when the time between data sources and VictoriaMetrics is improperly synchronized.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/312
This should reduce CPU usage when adding new entries to inverted index.
This should alos prevent from creating stalled cleaner goroutines for the created temporary parts,
since they were never closed.
This should fix the following issue: https://github.com/VictoriaMetrics/VictoriaMetrics/issues/316 .
A single heavy request can saturate all the available CPUs, so let's limit the number of concurrent requests to lower value.
This will give more chances for executing insert path.
This makes the behaviour for these functions similar to Prometheus when processing broken time series with irregular data points
like `gitlab_runner_jobs`. See https://gitlab.com/gitlab-org/gitlab-exporter/issues/50 for details.
This function can be used for simultaneous calculating of multiple `aggr_func*` functions
that accept range vector. For example, `aggr_over_time(("min_over_time", "max_over_time"), m[d])`
would calculate `min_over_time` and `max_over_time` for `m[d]`.
This should prevent from restoring from incomplete backups.
Add `-skipBackupCompleteCheck` command-line flag to `vmrestore` in order to be able restoring from old backups without `backup complete` file.
This should prevent from the following panic at app/vmselect/promql/binary_op.go:255:
BUG: len(leftVaues) must match len(rightValues) and len(dstValues)
The full list of functions added:
- `topk_min(k, q)` - returns top K time series with the max minimums on the given time range
- `topk_max(k, q)` - returns top K time series with the max maximums on the given time range
- `topk_avg(k, q)` - returns top K time series with the max averages on the given time range
- `topk_median(k, q)` - returns top K time series with the max medians on the given time range
- `bottomk_min(k, q)` - returns bottom K time series with the min minimums on the given time range
- `bottomk_max(k, q)` - returns bottom K time series with the min maximums on the given time range
- `bottomk_avg(k, q)` - returns bottom K time series with the min averages on the given time range
- `bottomk_median(k, q)` - returns bottom K time series with the min medians on the given time range
`runTransactions` call issues async deletions for transaction files. The previously issued transaction deletions
can race with the next call to `runTransactions`. Prevent this by waiting until all the pending transaction
deletions are funished in the beginning of `runTransactions`. Also make sure that all the pending transaction
deletions are finished before returning from `runTransactions`.
This should fix the issue on NFS when incompletely removed dirs may be left
after unclean shutdown (OOM, kill -9, hard reset, etc.), while the corresponding transaction
files are already removed.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/162
The metricID->metricName entry can be missing in the indexdb after unclean shutdown
when only a part of entries for new time series is written into indexdb.
Recover from such a situation by removing the broken metricID. New metricID
will be automatically created for time series with the given metricName
when new data point will arive to it.
This fixes an issue when VictoriaMetrics doesn't see the restored data after the following operations:
1. Stop VictoriaMetrics.
2. Delete `<-storageDataPath>` dir.
3. Start VictoriaMetrics, then stop it.
4. Restore data from backup with `vmrestore`.
5. Start VictoriaMetrics.
`vmrestore` didn't delete properly empty dirs in `<-storageDataPath>/indexdb` because of the remaining `flock.lock` files in these dirs.
`start` arg has higher chances to be set properly comparing to `end` arg,
so it is expected that the `end` arg could be adjusted if it was set incorrectly.
See the corresponding benchmark in Prometheus - 23c0299d85/tsdb/head_bench_test.go (L52)
The benchmark allows performing apples-to-apples comparison of time series search
in Prometheus and VictoriaMetrics. The following article - https://www.robustperception.io/evaluating-performance-and-correctness -
contains incorrect numbers for VictoriaMetrics, since there wasn't this benchmark yet. Fix this.
Benchmarks can be repeated with the following commands from Prometheus and VictoriaMetrics source code roots:
- Prometheus: GOMAXPROCS=1 go test ./tsdb/ -run=111 -bench=BenchmarkHeadPostingForMatchers
- VictoriaMetrics: GOMAXPROCS=1 go test ./lib/storage/ -run=111 -bench=BenchmarkHeadPostingForMatchers
Benchmark results:
benchmark old ns/op new ns/op delta
BenchmarkHeadPostingForMatchers/n="1" 272756688 364977 -99.87%
BenchmarkHeadPostingForMatchers/n="1",j="foo" 138132923 1181636 -99.14%
BenchmarkHeadPostingForMatchers/j="foo",n="1" 134723762 1141578 -99.15%
BenchmarkHeadPostingForMatchers/n="1",j!="foo" 195823953 1148056 -99.41%
BenchmarkHeadPostingForMatchers/i=~".*" 7962582919 8716755 -99.89%
BenchmarkHeadPostingForMatchers/i=~".+" 7589543864 12096587 -99.84%
BenchmarkHeadPostingForMatchers/i=~"" 1142371741 16164560 -98.59%
BenchmarkHeadPostingForMatchers/i!="" 9964150263 12230021 -99.88%
BenchmarkHeadPostingForMatchers/n="1",i=~".*",j="foo" 216995884 1173476 -99.46%
BenchmarkHeadPostingForMatchers/n="1",i=~".*",i!="2",j="foo" 202541348 1299743 -99.36%
BenchmarkHeadPostingForMatchers/n="1",i!="" 486285711 11555193 -97.62%
BenchmarkHeadPostingForMatchers/n="1",i!="",j="foo" 350776931 5607506 -98.40%
BenchmarkHeadPostingForMatchers/n="1",i=~".+",j="foo" 380888565 6380335 -98.32%
BenchmarkHeadPostingForMatchers/n="1",i=~"1.+",j="foo" 89500296 2078970 -97.68%
BenchmarkHeadPostingForMatchers/n="1",i=~".+",i!="2",j="foo" 379529654 6561368 -98.27%
BenchmarkHeadPostingForMatchers/n="1",i=~".+",i!~"2.*",j="foo" 424563825 6757132 -98.41%
The first column (old) is for Prometheus, the second column (new) is for VictoriaMetrics.
As you can see, VictoriaMetrics outperforms Prometheus by more than 100x in almost all the test cases of this benchmark.
Prometheus was using 3.5GB of RAM during the benchmark, while VictoriaMetrics was using 400MB of RAM.
The current day could miss entries for already stopped time series before
enabling per-day index.
This fixes the issue when queries return empty results during the first hour after
upgrading to v1.29.*
Production workload shows that the index requires ~4Kb of RAM per active time series.
This is too much for high number of active time series, so let's delete this index.
Now the queries should fall back to the index for the current day instead of the index
for the recent hour. The query performance for the current day index should be good enough
given the 100M rows/sec scan speed per CPU core.
Issues fixed:
- Slow startup times. Now the index is loaded from cache during start.
- High memory usage related to superflouos index copies every 10 seconds.
Production load with >10M active time series showed it could
slow down VictoriaMetrics startup times and could eat
all the memory leading to OOM.
Remove inmemory inverted index for recent hours until thorough
testing on production data shows it works OK.
Continue trying to remove NFS directory on temporary errors for up to a minute.
The previous async removal process breaks in the following case during VictoriaMetrics start
- VictoriaMetrics opens index, finds incomplete merge transactions and starts replaying them.
- The transaction instructs removing old directories for parts, which were already merged into bigger part.
- VictoriaMetrics removes these directories, but their removal is delayed due to NFS errors.
- VictoriaMetrics scans partition directory after all the incomplete merge transactions are finished
and finds directories, which should be removed, but weren't still removed due to NFS errors.
- VictoriaMetrics panics when it finds unexpected empty directory.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/162
Incremental aggregate functions don't keep all the selected time series in memory -
they keep only up to GOMAXPROCS time series for incremental aggregations.
Take into account that the number of time series in RAM can be higher if they are split
into many groups with `by (...)` or `without (...)` modifiers.
This should reduce the number of `not enough memory for processing ... data points` false
positive errors.
The origin of the error has been detected and documented in the code,
so it is enough to export a counter for such errors at `vm_index_blocks_with_metric_ids_incorrect_order_total`,
so it could be monitored and alerted on high error rates.
Export also the counter for processed index blocks with metricIDs - `vm_index_blocks_with_metric_ids_processed_total`,
so its' rate could be compared to `rate(vm_index_blocks_with_metric_ids_incorrect_order_total)`.
Slow loops could require seeks and expensive regexp matching, while fast loops just scans
all the metricIDs for the given `tag=value` prefix. So these operations must have separate
max loops multiplier.
The fastest tag filters are non-negative non-regexp, since they are the most specific.
The slowest tag filters are negative regexp, since they require scanning
all the entries for the given label.
This flag is similar to `-search.lookback-delta` if set. The max lookback interval is determined dynamically
from interval between datapoints for each input time series if the flag isn't set.
The interval can be overriden on per-query basis by passing `max_lookback=<duration>` query arg to `/api/v1/query` and `/api/v1/query_range`.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/209
`increases_over_time(q[d])` returns the number of `q` increases during the given duration `d`.
`decreases_over_time(q[d])` returns the number of `q` decreases during the given duration `d`.
This should improve inverted index search performance for filters matching big number of time series,
since `lib/uint64set.Set` is faster than `map[uint64]struct{}` for both `Add` and `Has` calls.
See the corresponding benchmarks in `lib/uint64set`.
This should improve lookup performance if the same `label=value` pair exists
in big number of time series.
This should also reduce memory usage for mergeset data cache, since `tag->metricIDs` rows
occupy less space than the original `tag->metricID` rows.
The number of MetricID->TSID entries is smaller than the number of tag->MetricID entries
and MetricID->TSID entries are usually shorter than tag->MetricID entries.
This should improve performance when selecting all the metricIDs.
The follosing issues were fixed:
- VictoriaMetrics could leave superflouos labels when using `on` or `ignoring` modifiers
- VictoriaMetrics could return `duplicate timeseries` error when using `group_left` or `group_right` with non-empty label list
Do not give up after 11 attempts of directory removal on laggy NFS.
Add `vm_nfs_dir_remove_failed_attempts_total` metric for counting the number of failed attempts
on directory removal.
Log failed attempts on directory removal after long sleep times.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/162
The maximum inmemory file size now depends on `-memory.allowedPercent`.
This should improve performance and reduce the number of filesystem calls
on machines with big amounts of RAM when performing heavy queries
over big number of samples and time series.
The commend became outdated after the commit ed6ac1a5df027f0dfc22448e3b27c26b6f77c67a,
which stops merging of small parts on graceful shutdown instead of waiting
for their completion.
This function is useful for determining time series lifetime.
`d` must exceed the expected lifetime of the time series, otherwise
the function would return values close to `d`.
This should reduce the amount of RAM required for processing time series
with non-zero churn rate.
The previous cache behavior can be restored with `-cache.oldBehavior` command-line flag.
Track also the number of dropped rows due to the exceeded timeout
on concurrency limit for Storage.AddRows. This number is tracked in `vm_concurrent_addrows_dropped_rows_total`
Real-world production data shows higher jitter than 1/8 of scrape interval.
This may results in gaps on the graph. So increase the allowed jitter to 1/4
of scrape interval in order to reduce the probability of gaps on the graphs
over time series with high jitter for scrape_interval.
Default rollup function is `last_over_time`. It must support adjusting
the provided window in order to prevent from gaps on the graph
for window values smaller than scrape interval.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/134
- Use `-prod` binaries instead of development binaries for both deb and rpm packages.
- Fix binary directory from /usr/sbin to /usr/local/bin as outlined in package/victoria-metrics.service
- Fix binary name from `victoriametrics` to `victoria-metrics-prod` in package/victoria-metrics.service
* initial commit of deb packaging
* Incorporated feedback from @valyala:
- Put data directory under /var/lib
- More beef in systemd file
- Packaging for arm64
- Package all target which builds and packages both amd64 and arm64
* Remove PIDFile from systemd unit, useless
* per PR feedback, move debian specific files into deb subdirectory
Updates #107 .
- Series with negative values are always gauges
- Counters may only have increasing values with possible counter resets
This should improve compression ratio for gauge series which
were previously mistakenly detected as counters.
Calculate incremental aggregates for `aggr(metric_selector)` function instead of
keeping all the time series matching the given `metric_selector` in memory.
- Do not overwrite last points by the previous NaNs, since this may result in empty time series.
- Overwrite the last 2 points instead of 3. This should be enough in most cases.
Data in ExtDB cannot be changed, so it is OK to use unversioned keys for tag cache.
This should improve performance for index lookups over big amount of time series.
`deriv_fast` calculates derivative based on the first and the last point on the interval
instead of calculating linear regression based on all the data points on the interval.
Updates https://github.com/VictoriaMetrics/VictoriaMetrics/issues/73
Prometheus establishes a connection per shard in remote_write config.
By default it establishes up to 1000 connections to remote storage (max_shards: 1000).
This is quite big, so set `max_shards: 100` in the recommmended Prometheus config.
This should improve performance after restart when the db contains a lot of time series
with high time series churn (i.e. metrics from Kubernetes with many pods and frequent deployments)
Unremoved directories may lead to inconsistent data directory,
so VictoriaMetrics will fail to start next time.
So panic on the first error when trying to remove directory in order
to simplify recover process.
This should improve speed for searching metrics among high number of time series
with high churn rate like in big Kubernetes clusters with frequent deployments.
ReadLinesBlock may accept dstBuf with non-zero length. In this case the last line without trailing newline isn't read.
Fix this by comparing len(dstBuf) to 0 instead of its original length.
Add sigint as stopsignal to docker file. You can find more here: https://docs.docker.com/engine/reference/builder/#usage
With this change, the main process inside the container will receive SIGINT, and after a grace period, SIGKILL.
Previously arbitrary time series could be returned from `quantile`
depending on sort order for the last data point in the selected range.
Fix this by returning the calculated time series.
Fixes https://github.com/VictoriaMetrics/VictoriaMetrics/issues/55
- Increase update iterval from 1s to 10s. This should reduce CPU usage
for large amounts of metric ids with constant churn.
- Reduce pendingTodayMetricIDsLock lock duration during the update.
Please provide a brief description of the changes you made. Be as specific as possible to help others understand the purpose and impact of your modifications.
### Checklist
The following checks are **mandatory**:
- [ ] My change adheres [VictoriaMetrics contributing guidelines](https://docs.victoriametrics.com/contributing/).
Cluster version is available [here](https://github.com/VictoriaMetrics/VictoriaMetrics/tree/cluster).
VictoriaMetrics is a fast, cost-saving, and scalable solution for monitoring and managing time series data. It delivers high performance and reliability, making it an ideal choice for businesses of all sizes.
Here are some resources and information about VictoriaMetrics:
Yes, we open-source both the single-node VictoriaMetrics and the cluster version.
## Prominent features
* Supports [Prometheus querying API](https://prometheus.io/docs/prometheus/latest/querying/api/), so it can be used as Prometheus drop-in replacement in Grafana.
Additionally, VictoriaMetrics extends PromQL with opt-in [useful features](https://github.com/VictoriaMetrics/VictoriaMetrics/wiki/ExtendedPromQL).
*High performance and good scalability for both [inserts](https://medium.com/@valyala/high-cardinality-tsdb-benchmarks-victoriametrics-vs-timescaledb-vs-influxdb-13e6ee64dd6b)
and [selects](https://medium.com/@valyala/when-size-matters-benchmarking-victoriametrics-vs-timescale-and-influxdb-6035811952d4).
[Outperforms InfluxDB and TimescaleDB by up to 20x](https://medium.com/@valyala/measuring-vertical-scalability-for-time-series-databases-in-google-cloud-92550d78d8ae).
*[Uses 10x less RAM than InfluxDB](https://medium.com/@valyala/insert-benchmarks-with-inch-influxdb-vs-victoriametrics-e31a41ae2893) when working with millions of unique time series (aka high cardinality).
*High data compression, so [up to 70x more data points](https://medium.com/@valyala/when-size-matters-benchmarking-victoriametrics-vs-timescale-and-influxdb-6035811952d4)
may be crammed into a limited storage comparing to TimescaleDB.
*Optimized for storage with high-latency IO and low iops (HDD and network storage in AWS, Google Cloud, Microsoft Azure, etc). See [graphs from these benchmarks](https://medium.com/@valyala/high-cardinality-tsdb-benchmarks-victoriametrics-vs-timescaledb-vs-influxdb-13e6ee64dd6b).
* A single-node VictoriaMetrics may substitute moderately sized clusters built with competing solutions such as Thanos, Uber M3, Cortex, InfluxDB or TimescaleDB.
See [vertical scalability benchmarks](https://medium.com/@valyala/measuring-vertical-scalability-for-time-series-databases-in-google-cloud-92550d78d8ae).
* Easy operation:
*VictoriaMetrics consists of a single executable without external dependencies.
*All the configuration is done via explicit command-line flags with reasonable defaults.
*All the data is stored in a single directory pointed by `-storageDataPath` flag.
*Easy backups from [instant snapshots](https://medium.com/@valyala/how-victoriametrics-makes-instant-snapshots-for-multi-terabyte-time-series-data-e1f3fb0e0282).
* Storage is protected from corruption on unclean shutdown (i.e. hardware reset or `kill -9`) thanks to [the storage architecture](https://medium.com/@valyala/how-victoriametrics-makes-instant-snapshots-for-multi-terabyte-time-series-data-e1f3fb0e0282).
* Supports metrics' ingestion and backfilling via the following protocols:
* [InfluxDB line protocol](https://docs.influxdata.com/influxdb/v1.7/write_protocols/line_protocol_tutorial/)
*[Graphite plaintext protocol](https://graphite.readthedocs.io/en/latest/feeding-carbon.html) with [tags](https://graphite.readthedocs.io/en/latest/tags.html#carbon)
if `-graphiteListenAddr` is set.
* [OpenTSDB put message](http://opentsdb.net/docs/build/html/api_telnet/put.html) if `-opentsdbListenAddr` is set.
* Ideally works with big amounts of time series data from IoT sensors, connected car sensors and industrial sensors.
* Has open source [cluster version](https://github.com/VictoriaMetrics/VictoriaMetrics/tree/cluster).
## Operation
### Table of contents
*[How to build from sources](#how-to-build-from-sources)
*[How to start VictoriaMetrics](#how-to-start-victoriametrics)
* [Prometheus setup](#prometheus-setup)
* [Grafana setup](#grafana-setup)
* [How to send data from InfluxDB-compatible agents such as Telegraf](#how-to-send-data-from-influxdb-compatible-agents-such-as-telegraf)
* [How to send data from Graphite-compatible agents such as StatsD](#how-to-send-data-from-graphite-compatible-agents-such-as-statsd)
* [How to send data from OpenTSDB-compatible agents](#how-to-send-data-from-opentsdb-compatible-agents)
* [How to apply new config / ugrade VictoriaMetrics](#how-to-apply-new-config--upgrade-victoriametrics)
* [How to work with snapshots](#how-to-work-with-snapshots)
* [How to delete time series](#how-to-delete-time-series)
* [How to export time series](#how-to-export-time-series)
*[Federation](#federation)
*[Capacity planning](#capacity-planning)
*[High Availability](#high-availability)
*[Multiple retentions](#multiple-retentions)
*[Scalability and cluster version](#scalability-and-cluster-version)
*[Security](#security)
* [Tuning](#tuning)
* [Monitoring](#monitoring)
* [Troubleshooting](#troubleshooting)
* [Community and contributions](#community-and-contributions)
* [Reporting bugs](#reporting-bugs)
### How to build from sources
We recommend using either [binary releases](https://github.com/VictoriaMetrics/VictoriaMetrics/releases) or
[docker images](https://hub.docker.com/r/valyala/victoria-metrics/) instead of building VictoriaMetrics
from sources. Building from sources is reasonable when developing an additional features specific
to your needs.
#### Development build
1. [Install Go](https://golang.org/doc/install). The minimum supported version is Go 1.12.
2. Run `go build ./app/victoria-metrics` from the root folder of the repository.
It will build `victoria-metrics` binary in the root folder of the repository.
2) Send data to the given address from OpenTSDB-compatible agents.
### How to apply new config / upgrade VictoriaMetrics?
VictoriaMetrics must be restarted in order to upgrade or apply new config:
1) Send `SIGINT` signal to VictoriaMetrics process in order to gracefully stop it.
2) Wait until the process stops. This can take a few seconds.
3) Start the upgraded VictoriaMetrics with new config.
### How to work with snapshots?
Navigate to `http://<victoriametrics-addr>:8428/snapshot/create` in order to create an instant snapshot.
The page will return the following JSON response:
```
{"status":"ok","snapshot":"<snapshot-name>"}
```
Snapshots are created under `<-storageDataPath>/snapshots` directory, where `<-storageDataPath>`
is the command-line flag value. Snapshots can be archived to backup storage via `rsync -L`, `scp -r`
or any similar tool that follows symlinks during copying.
The `http://<victoriametrics-addr>:8428/snapshot/list` page contains the list of available snapshots.
Navigate to `http://<victoriametrics-addr>:8428/snapshot/delete?snapshot=<snapshot-name>` in order
to delete `<snapshot-name>` snapshot.
Navigate to `http://<victoriametrics-addr>:8428/snapshot/delete_all` in order to delete all the snapshots.
### How to delete time series?
Send a request to `http://<victoriametrics-addr>:8428/api/v1/admin/tsdb/delete_series?match[]=<timeseries_selector_for_delete>`,
where `<timeseries_selector_for_delete>` may contain any [time series selector](https://prometheus.io/docs/prometheus/latest/querying/basics/#time-series-selectors)
for metrics to delete. After that all the time series matching the given selector are deleted. Storage space for
the deleted time series isn't freed instantly - it is freed during subsequent merges of data files.
### How to export time series?
Send a request to `http://<victoriametrics-addr>:8428/api/v1/export?match[]=<timeseries_selector_for_export>`,
where `<timeseries_selector_for_export>` may contain any [time series selector](https://prometheus.io/docs/prometheus/latest/querying/basics/#time-series-selectors)
for metrics to export. The response would contain all the data for the selected time series in [JSON streaming format](https://en.wikipedia.org/wiki/JSON_streaming#Line-delimited_JSON).
Each JSON line would contain data for a single time series. An example output:
at `http://<victoriametrics-addr>:8428/federate?match[]=<timeseries_selector_for_federation>`.
Optional `start` and `end` args may be added to the request in order to scrape the last point for each selected time series on the `[start ... end]` interval.
`start` and `end` may contain either unix timestamp in seconds or [RFC3339](https://www.ietf.org/rfc/rfc3339.txt) values. By default the last point
on the interval `[now - max_lookback ... now]` is scraped for each time series. Default value for `max_lookback` is `5m` (5 minutes), but can be overriden.
For instance, `/federate?match[]=up&max_lookback=1h` would return last points on the `[now - 1h ... now]` interval. This may be useful for time series federation
with scrape intervals exceeding `5m`.
### Capacity planning
Rough estimation of the required resources:
* RAM size: less than 1KB per active time series. So, ~1GB of RAM is required for 1M active time series.
Time series is considered active if new data points have been added to it recently or if it has been recently queried.
VictoriaMetrics stores various caches in RAM. Memory size for these caches may be limited with `-memory.allowedPercent` flag.
* CPU cores: a CPU core per 300K inserted data points per second. So, ~4 CPU cores are required for processing
the insert stream of 1M data points per second.
If you see lower numbers per CPU core, then it is likely active time series info doesn't fit caches,
so you need more RAM for lowering CPU usage.
* Storage size: less than a byte per data point on average. So, ~260GB is required for storing a month-long insert stream
of 100K data points per second.
The actual storage size heavily depends on data randomness (entropy). Higher randomness means higher storage size requirements.
### High availability
1) Install multiple VictoriaMetrics instances in distinct datacenters.
2) Add addresses of these instances to `remote_write` section in Prometheus config:
4) Now Prometheus should write data into all the configured `remote_write` urls in parallel.
5) Set up [Promxy](https://github.com/jacksontj/promxy) in front of all the VictoriaMetrics replicas.
6) Set up Prometheus datasource in Grafana that points to Promxy.
### Multiple retentions
Just start multiple VictoriaMetrics instances with distinct values for the following flags:
*`-retentionPeriod`
*`-storageDataPath`, so the data for each retention period is saved in a separate directory
*`-httpListenAddr`, so clients may reach VictoriaMetrics instance with proper retention
### Scalability and cluster version
Though single-node VictoriaMetrics cannot scale to multiple nodes, it is optimized for resource usage - storage size / bandwidth / IOPS, RAM, CPU.
This means that a single-node VictoriaMetrics may scale vertically and substitute moderately sized cluster built with competing solutions
such as Thanos, Uber M3, InfluxDB or TimescaleDB.
So try single-node VictoriaMetrics at first and then [switch to cluster version](https://github.com/VictoriaMetrics/VictoriaMetrics/tree/cluster) if you still need
horizontally scalable long-term remote storage for really large Prometheus deployments.
[Contact us](mailto:info@victoriametrics.com) for paid support.
### Security
Do not forget protecting sensitive endpoints in VictoriaMetrics when exposing it to untrusted networks such as internet.
Consider setting the following command-line flags:
*`-tls`, `-tlsCertFile` and `-tlsKeyFile` for switching from HTTP to HTTPS.
*`-httpAuth.username` and `-httpAuth.password` for protecting all the HTTP endpoints
with [HTTP Basic Authentication](https://en.wikipedia.org/wiki/Basic_access_authentication).
*`-deleteAuthKey` for protecting `/api/v1/admin/tsdb/delete_series` endpoint. See [how to delete time series](#how-to-delete-time-series).
*`-snapshotAuthKey` for protecting `/snapshot*` endpoints. See [how to work with snapshots](#how-to-work-with-snapshots).
Explicitly set internal network interface for TCP and UDP ports for data ingestion with Graphite and OpenTSDB formats.
For example, substitute `-graphiteListenAddr=:2003` with `-graphiteListenAddr=<internal_iface_ip>:2003`.
### Tuning
* There is no need in VictoriaMetrics tuning, since it uses reasonable defaults for command-line flags,
which are automatically adjusted for the available CPU and RAM resources.
* There is no need in Operating System tuning, since VictoriaMetrics is optimized for default OS settings.
The only option is increasing the limit on [the number open files in the OS](https://medium.com/@muhammadtriwibowo/set-permanently-ulimit-n-open-files-in-ubuntu-4d61064429a),
so Prometheus instances could establish more connections to VictoriaMetrics.
### Monitoring
VictoriaMetrics exports internal metrics in Prometheus format on the `/metrics` page.
Add this page to Prometheus' scrape config in order to collect VictoriaMetrics metrics.
There is [an official Grafana dashboard for single-node VictoriaMetrics](https://grafana.com/dashboards/10229).
### Troubleshooting
* If VictoriaMetrics works slowly and eats more than a CPU core per 100K ingested data points per second,
then it is likely you have too many active time series for the current amount of RAM.
It is recommended increasing the amount of RAM on the node with VictoriaMetrics in order to improve
ingestion performance.
Another option is to increase `-memory.allowedPercent` command-line flag value. Be careful with this
option, since too big value for `-memory.allowedPercent` may result in high I/O usage.
VictoriaMetrics is optimized for timeseries data, even when old time series are constantly replaced by new ones at a high rate, it offers a lot of features:
***Long-term storage for Prometheus** or as a drop-in replacement for Prometheus and Graphite in Grafana.
* **Powerful stream aggregation**: Can be used as a StatsD alternative.
* **Ideal for big data**: Works well with large amounts of time series data from APM, Kubernetes, IoT sensors, connected cars, industrial telemetry, financial data and various [Enterprise workloads](https://docs.victoriametrics.com/enterprise/).
***Query language**: Supports both PromQL and the more performant MetricsQL.
***Easy to setup**: No dependencies, single [small binary](https://medium.com/@valyala/stripping-dependency-bloat-in-victoriametrics-docker-image-983fb5912b0d), configuration through command-line flags, but the default is also fine-tuned; backup and restore with [instant snapshots](https://medium.com/@valyala/how-victoriametrics-makes-instant-snapshots-for-multi-terabyte-time-series-data-e1f3fb0e0282).
* **Global query view**: Multiple Prometheus instances or any other data sources may ingest data into VictoriaMetrics and queried via a single query.
***Various Protocols**: Support metric scraping, ingestion and backfilling in various protocol.
- **Multiple retentions**: Reducing storage costs by specifying different retentions for different datasets.
- **Downsampling**: Reducing storage costs and increasing performance for queries over historical data.
- **Stable releases** with long-term support lines ([LTS](https://docs.victoriametrics.com/lts-releases/)).
-**Comprehensive support**: First-class consulting, feature requests and technical support provided by the core VictoriaMetrics dev team.
-Many other features, which you can read about on [the Enterprise page](https://docs.victoriametrics.com/enterprise/).
[Contact us](mailto:info@victoriametrics.com) if you need enterprise support for VictoriaMetrics. Or you can request a free trial license [here](https://victoriametrics.com/products/enterprise/trial/), downloaded Enterprise binaries are available at [Github Releases](https://github.com/VictoriaMetrics/VictoriaMetrics/releases/latest).
We strictly apply security measures in everything we do. VictoriaMetrics has achieved security certifications for Database Software Development and Software-Based Monitoring Services. See [Security page](https://victoriametrics.com/security/) for more details.
## Benchmarks
Some good benchmarks VictoriaMetrics achieved:
***Minimal memory footprint**: handling millions of unique timeseries with [10x less RAM](https://medium.com/@valyala/insert-benchmarks-with-inch-influxdb-vs-victoriametrics-e31a41ae2893) than InfluxDB, up to [7x less RAM](https://valyala.medium.com/prometheus-vs-victoriametrics-benchmark-on-node-exporter-metrics-4ca29c75590f) than Prometheus, Thanos or Cortex.
***Highly scalable and performance** for [data ingestion](https://medium.com/@valyala/high-cardinality-tsdb-benchmarks-victoriametrics-vs-timescaledb-vs-influxdb-13e6ee64dd6b) and [querying](https://medium.com/@valyala/when-size-matters-benchmarking-victoriametrics-vs-timescale-and-influxdb-6035811952d4), [20x outperforms](https://medium.com/@valyala/insert-benchmarks-with-inch-influxdb-vs-victoriametrics-e31a41ae2893) InfluxDB and TimescaleDB.
***High data compression**: [70x more data points](https://medium.com/@valyala/when-size-matters-benchmarking-victoriametrics-vs-timescale-and-influxdb-6035811952d4) may be stored into limited storage than TimescaleDB, [7x less storage](https://valyala.medium.com/prometheus-vs-victoriametrics-benchmark-on-node-exporter-metrics-4ca29c75590f) space is required than Prometheus, Thanos or Cortex.
***Reducing storage costs**: [10x more effective](https://docs.victoriametrics.com/casestudies/#grammarly) than Graphite according to the Grammarly case study.
***A single-node VictoriaMetrics** can replace medium-sized clusters built with competing solutions such as Thanos, M3DB, Cortex, InfluxDB or TimescaleDB. See [VictoriaMetrics vs Thanos](https://medium.com/@valyala/comparing-thanos-to-victoriametrics-cluster-b193bea1683), [Measuring vertical scalability](https://medium.com/@valyala/measuring-vertical-scalability-for-time-series-databases-in-google-cloud-92550d78d8ae), [Remote write storage wars - PromCon 2019](https://promcon.io/2019-munich/talks/remote-write-storage-wars/).
***Optimized for storage**: [Works well with high-latency IO](https://medium.com/@valyala/high-cardinality-tsdb-benchmarks-victoriametrics-vs-timescaledb-vs-influxdb-13e6ee64dd6b) and low IOPS (HDD and network storage in AWS, Google Cloud, Microsoft Azure, etc.).
## Community and contributions
Feel free asking any questions regarding VictoriaMetrics [here](https://groups.google.com/forum/#!forum/victorametrics-users).
Feel free asking any questions regarding VictoriaMetrics:
We are open to third-party pull requests provided they follow [KISS design principle](https://en.wikipedia.org/wiki/KISS_principle):
* [Slack Inviter](https://slack.victoriametrics.com/) and [Slack channel](https://victoriametrics.slack.com/)
- Minimize the number of moving parts in the distributed system.
- Avoid automated decisions, which may hurt cluster availability, consistency or performance.
If you like VictoriaMetrics and want to contribute, then please [read these docs](https://docs.victoriametrics.com/contributing/).
Adhering `KISS` principle simplifies the resulting code and architecture, so it can be reviewed, understood and verified by many people.
## VictoriaMetrics Logo
## Reporting bugs
Report bugs and propose new features [here](https://github.com/VictoriaMetrics/VictoriaMetrics/issues).
## Victoria Metrics Logo
[Zip](VM_logo.zip) contains three folders with different image orientation (main color and inverted version).
[Zip](https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/VM_logo.zip) contains three folders with different image orientations (main color and inverted version).
Files included in each folder:
@@ -408,22 +103,22 @@ Files included in each folder:
httpListenAddrs=flagutil.NewArrayString("httpListenAddr","TCP address to listen for incoming http requests. See also -httpListenAddr.useProxyProtocol")
useProxyProtocol=flagutil.NewArrayBool("httpListenAddr.useProxyProtocol","Whether to use proxy protocol for connections accepted at the given -httpListenAddr . "+
"See https://www.haproxy.org/download/1.8/doc/proxy-protocol.txt . "+
"With enabled proxy protocol http server cannot serve regular /metrics endpoint. Use -pushmetrics.url for metrics pushing")
)
funcmain(){
// Write flags and help message to stdout, since it is easier to grep or pipe.
flag.CommandLine.SetOutput(os.Stdout)
flag.Usage=usage
envflag.Parse()
buildinfo.Init()
logger.Init()
listenAddrs:=*httpListenAddrs
iflen(listenAddrs)==0{
listenAddrs=[]string{":9428"}
}
logger.Infof("starting VictoriaLogs at %q...",listenAddrs)
varhttpListenAddr=flag.String("httpListenAddr",":8428","TCP address to listen for http connections")
var(
httpListenAddrs=flagutil.NewArrayString("httpListenAddr","TCP addresses to listen for incoming http requests. See also -tls and -httpListenAddr.useProxyProtocol")
useProxyProtocol=flagutil.NewArrayBool("httpListenAddr.useProxyProtocol","Whether to use proxy protocol for connections accepted at the corresponding -httpListenAddr . "+
"See https://www.haproxy.org/download/1.8/doc/proxy-protocol.txt . "+
"With enabled proxy protocol http server cannot serve regular /metrics endpoint. Use -pushmetrics.url for metrics pushing")
minScrapeInterval=flag.Duration("dedup.minScrapeInterval",0,"Leave only the last sample in every time series per each discrete interval "+
"equal to -dedup.minScrapeInterval > 0. See also -streamAggr.dedupInterval and https://docs.victoriametrics.com/#deduplication")
dryRun=flag.Bool("dryRun",false,"Whether to check config files without running VictoriaMetrics. The following config files are checked: "+
"-promscrape.config, -relabelConfig and -streamAggr.config. Unknown config entries aren't allowed in -promscrape.config by default. "+
"This can be changed with -promscrape.config.strictParse=false command-line flag")
inmemoryDataFlushInterval=flag.Duration("inmemoryDataFlushInterval",5*time.Second,"The interval for guaranteed saving of in-memory data to disk. "+
"The saved data survives unclean shutdowns such as OOM crash, hardware reset, SIGKILL, etc. "+
"Bigger intervals may help increase the lifetime of flash storage with limited write cycles (e.g. Raspberry PI). "+
"Smaller intervals increase disk IO load. Minimum supported value is 1s")
)
funcmain(){
flag.Parse()
// Write flags and help message to stdout, since it is easier to grep or pipe.
flag.CommandLine.SetOutput(os.Stdout)
flag.Usage=usage
envflag.Parse()
buildinfo.Init()
logger.Init()
logger.Infof("starting VictoraMetrics at %q...",*httpListenAddr)
ifpromscrape.IsDryRun(){
*dryRun=true
}
if*dryRun{
iferr:=promscrape.CheckConfig();err!=nil{
logger.Fatalf("error when checking -promscrape.config: %s",err)
t.Fatalf("not expected number of csv lines want=%d\ngot=%d test=%s.%s\n\response=%q",len(data),test.ExpectedResultLinesCount,q,test.Issue,strings.Join(data,"\n"))
syslogTimezone=flag.String("syslog.timezone","Local","Timezone to use when parsing timestamps in RFC3164 syslog messages. Timezone must be a valid IANA Time Zone. "+
"For example: America/New_York, Europe/Berlin, Etc/GMT+3 . See https://docs.victoriametrics.com/victorialogs/data-ingestion/syslog/")
syslogTenantIDTCP=flagutil.NewArrayString("syslog.tenantID.tcp","TenantID for logs ingested via the corresponding -syslog.listenAddr.tcp. "+
"See https://docs.victoriametrics.com/victorialogs/data-ingestion/syslog/")
syslogTenantIDUDP=flagutil.NewArrayString("syslog.tenantID.udp","TenantID for logs ingested via the corresponding -syslog.listenAddr.udp. "+
"See https://docs.victoriametrics.com/victorialogs/data-ingestion/syslog/")
listenAddrTCP=flagutil.NewArrayString("syslog.listenAddr.tcp","Comma-separated list of TCP addresses to listen to for Syslog messages. "+
"See https://docs.victoriametrics.com/victorialogs/data-ingestion/syslog/")
listenAddrUDP=flagutil.NewArrayString("syslog.listenAddr.udp","Comma-separated list of UDP address to listen to for Syslog messages. "+
"See https://docs.victoriametrics.com/victorialogs/data-ingestion/syslog/")
tlsEnable=flagutil.NewArrayBool("syslog.tls","Whether to enable TLS for receiving syslog messages at the corresponding -syslog.listenAddr.tcp. "+
"The corresponding -syslog.tlsCertFile and -syslog.tlsKeyFile must be set if -syslog.tls is set. See https://docs.victoriametrics.com/victorialogs/data-ingestion/syslog/#security")
tlsCertFile=flagutil.NewArrayString("syslog.tlsCertFile","Path to file with TLS certificate for the corresponding -syslog.listenAddr.tcp if the corresponding -syslog.tls is set. "+
"Prefer ECDSA certs instead of RSA certs as RSA certs are slower. The provided certificate file is automatically re-read every second, so it can be dynamically updated. "+
"See https://docs.victoriametrics.com/victorialogs/data-ingestion/syslog/#security")
tlsKeyFile=flagutil.NewArrayString("syslog.tlsKeyFile","Path to file with TLS key for the corresponding -syslog.listenAddr.tcp if the corresponding -syslog.tls is set. "+
"The provided key file is automatically re-read every second, so it can be dynamically updated. "+
"See https://docs.victoriametrics.com/victorialogs/data-ingestion/syslog/#security")
tlsCipherSuites=flagutil.NewArrayString("syslog.tlsCipherSuites","Optional list of TLS cipher suites for -syslog.listenAddr.tcp if -syslog.tls is set. "+
"See the list of supported cipher suites at https://pkg.go.dev/crypto/tls#pkg-constants . "+
"See also https://docs.victoriametrics.com/victorialogs/data-ingestion/syslog/#security")
tlsMinVersion=flag.String("syslog.tlsMinVersion","TLS13","The minimum TLS version to use for -syslog.listenAddr.tcp if -syslog.tls is set. "+
"Supported values: TLS10, TLS11, TLS12, TLS13. "+
"See https://docs.victoriametrics.com/victorialogs/data-ingestion/syslog/#security")
compressMethodTCP=flagutil.NewArrayString("syslog.compressMethod.tcp","Compression method for syslog messages received at the corresponding -syslog.listenAddr.tcp. "+
"Supported values: none, gzip, deflate. See https://docs.victoriametrics.com/victorialogs/data-ingestion/syslog/#compression")
compressMethodUDP=flagutil.NewArrayString("syslog.compressMethod.udp","Compression method for syslog messages received at the corresponding -syslog.listenAddr.udp. "+
"Supported values: none, gzip, deflate. See https://docs.victoriametrics.com/victorialogs/data-ingestion/syslog/#compression")
useLocalTimestampTCP=flagutil.NewArrayBool("syslog.useLocalTimestamp.tcp","Whether to use local timestamp instead of the original timestamp for the ingested syslog messages "+
"at the corresponding -syslog.listenAddr.tcp. See https://docs.victoriametrics.com/victorialogs/data-ingestion/syslog/#log-timestamps")
useLocalTimestampUDP=flagutil.NewArrayBool("syslog.useLocalTimestamp.udp","Whether to use local timestamp instead of the original timestamp for the ingested syslog messages "+
"at the corresponding -syslog.listenAddr.udp. See https://docs.victoriametrics.com/victorialogs/data-ingestion/syslog/#log-timestamps")
)
// MustInit initializes syslog parser at the given -syslog.listenAddr.tcp and -syslog.listenAddr.udp ports
//
// This function must be called after flag.Parse().
//
// MustStop() must be called in order to free up resources occupied by the initialized syslog parser.
funcMustInit(){
ifworkersStopCh!=nil{
logger.Panicf("BUG: MustInit() called twice without MustStop() call")
`<123>1 2023-06-03T17:42:12.345Z mymachine.example.com appname 12345 ID47 [exampleSDID@32473 iut="3" eventSource="Application 123 = ] 56" eventID="11211"] This is a test message with structured data.`,
})
}
funcTestSyslogLineReader_Failure(t*testing.T){
f:=func(datastring){
t.Helper()
r:=bytes.NewBufferString(data)
slr:=getSyslogLineReader(r)
deferputSyslogLineReader(slr)
ifslr.nextLine(){
t.Fatalf("expecting failure to read the first line")
{"priority":"123","facility":"15","severity":"3","format":"rfc5424","timestamp":"","hostname":"mymachine.example.com","app_name":"appname","proc_id":"12345","msg_id":"ID47","exampleSDID@32473.iut":"3","exampleSDID@32473.eventSource":"Application 123 = ] 56","exampleSDID@32473.eventID":"11211","_msg":"This is a test message with structured data."}`
The `run_id` field uniquely identifies every `vlogsgenerator` invocation.
### How to write logs to VictoriaLogs?
The generated logs can be written directly to VictoriaLogs by passing the address of [`/insert/jsonline` endpoint](https://docs.victoriametrics.com/victorialogs/data-ingestion/#json-stream-api)
to `-addr` command-line flag. For example, the following command writes the generated logs to VictoriaLogs running at `localhost`:
`vlogsgenerator` accepts various command-line flags, which can be used for configuring the number and the shape of the generated logs.
These flags can be inspected by running `vlogsgenerator -help`. Below are the most interesting flags:
*`-start` - starting timestamp for generating logs. Logs are evenly generated on the [`-start` ... `-end`] interval.
*`-end` - ending timestamp for generating logs. Logs are evenly generated on the [`-start` ... `-end`] interval.
*`-activeStreams` - the number of active [log streams](https://docs.victoriametrics.com/victorialogs/keyconcepts/#stream-fields) to generate.
*`-logsPerStream` - the number of log entries to generate per each log stream. Log entries are evenly distributed on the [`-start` ... `-end`] interval.
The total number of generated logs can be calculated as `-activeStreams` * `-logsPerStream`.
For example, the following command generates `1_000_000` log entries on the time range `[2024-01-01 - 2024-02-01]` across `100`
[log streams](https://docs.victoriametrics.com/victorialogs/keyconcepts/#stream-fields), where every logs stream contains `10_000` log entries,
and writes them to `http://localhost:9428/insert/jsonline`:
```
bin/vlogsgenerator \
-start=2024-01-01 -end=2024-02-01 \
-activeStreams=100 \
-logsPerStream=10_000 \
-addr=http://localhost:9428/insert/jsonline
```
### Churn rate
It is possible to generate churn rate for active [log streams](https://docs.victoriametrics.com/victorialogs/keyconcepts/#stream-fields)
by specifying `-totalStreams` command-line flag bigger than `-activeStreams`. For example, the following command generates
logs for `1000` total streams, while the number of active streams equals to `100`. This means that at every time there are logs for `100` streams,
but these streams change over the given [`-start` ... `-end`] time range, so the total number of streams on the given time range becomes `1000`:
```
bin/vlogsgenerator \
-start=2024-01-01 -end=2024-02-01 \
-activeStreams=100 \
-totalStreams=1_000 \
-logsPerStream=10_000 \
-addr=http://localhost:9428/insert/jsonline
```
In this case the total number of generated logs equals to `-totalStreams` * `-logsPerStream` = `10_000_000`.
### Benchmark tuning
By default `vlogsgenerator` generates and writes logs by a single worker. This may limit the maximum data ingestion rate during benchmarks.
The number of workers can be changed via `-workers` command-line flag. For example, the following command generates and writes logs with `16` workers:
```
bin/vlogsgenerator \
-start=2024-01-01 -end=2024-02-01 \
-activeStreams=100 \
-logsPerStream=10_000 \
-addr=http://localhost:9428/insert/jsonline \
-workers=16
```
### Output statistics
Every 10 seconds `vlogsgenerator` writes statistics about the generated logs into `stderr`. The frequency of the generated statistics can be adjusted via `-statInterval` command-line flag.
For example, the following command writes statistics every 2 seconds:
addr=flag.String("addr","stdout","HTTP address to push the generated logs to; if it is set to stdout, then logs are generated to stdout")
workers=flag.Int("workers",1,"The number of workers to use to push logs to -addr")
start=newTimeFlag("start","-1d","Generated logs start from this time; see https://docs.victoriametrics.com/#timestamp-formats")
end=newTimeFlag("end","0s","Generated logs end at this time; see https://docs.victoriametrics.com/#timestamp-formats")
activeStreams=flag.Int("activeStreams",100,"The number of active log streams to generate; see https://docs.victoriametrics.com/victorialogs/keyconcepts/#stream-fields")
totalStreams=flag.Int("totalStreams",0,"The number of total log streams; if -totalStreams > -activeStreams, then some active streams are substituted with new streams "+
"during data generation")
logsPerStream=flag.Int64("logsPerStream",1_000,"The number of log entries to generate per each log stream. Log entries are evenly distributed between -start and -end")
constFieldsPerLog=flag.Int("constFieldsPerLog",3,"The number of fields with constaint values to generate per each log entry; "+
"see https://docs.victoriametrics.com/victorialogs/keyconcepts/#data-model")
varFieldsPerLog=flag.Int("varFieldsPerLog",1,"The number of fields with variable values to generate per each log entry; "+
"see https://docs.victoriametrics.com/victorialogs/keyconcepts/#data-model")
dictFieldsPerLog=flag.Int("dictFieldsPerLog",2,"The number of fields with up to 8 different values to generate per each log entry; "+
"see https://docs.victoriametrics.com/victorialogs/keyconcepts/#data-model")
u8FieldsPerLog=flag.Int("u8FieldsPerLog",1,"The number of fields with uint8 values to generate per each log entry; "+
"see https://docs.victoriametrics.com/victorialogs/keyconcepts/#data-model")
u16FieldsPerLog=flag.Int("u16FieldsPerLog",1,"The number of fields with uint16 values to generate per each log entry; "+
"see https://docs.victoriametrics.com/victorialogs/keyconcepts/#data-model")
u32FieldsPerLog=flag.Int("u32FieldsPerLog",1,"The number of fields with uint32 values to generate per each log entry; "+
"see https://docs.victoriametrics.com/victorialogs/keyconcepts/#data-model")
u64FieldsPerLog=flag.Int("u64FieldsPerLog",1,"The number of fields with uint64 values to generate per each log entry; "+
"see https://docs.victoriametrics.com/victorialogs/keyconcepts/#data-model")
floatFieldsPerLog=flag.Int("floatFieldsPerLog",1,"The number of fields with float64 values to generate per each log entry; "+
"see https://docs.victoriametrics.com/victorialogs/keyconcepts/#data-model")
ipFieldsPerLog=flag.Int("ipFieldsPerLog",1,"The number of fields with IPv4 values to generate per each log entry; "+
"see https://docs.victoriametrics.com/victorialogs/keyconcepts/#data-model")
timestampFieldsPerLog=flag.Int("timestampFieldsPerLog",1,"The number of fields with ISO8601 timestamps per each log entry; "+
"see https://docs.victoriametrics.com/victorialogs/keyconcepts/#data-model")
jsonFieldsPerLog=flag.Int("jsonFieldsPerLog",1,"The number of JSON fields to generate per each log entry; "+
"see https://docs.victoriametrics.com/victorialogs/keyconcepts/#data-model")
statInterval=flag.Duration("statInterval",10*time.Second,"The interval between publishing the stats")
)
funcmain(){
// Write flags and help message to stdout, since it is easier to grep or pipe.
logger.Infof("start -workers=%d workers for ingesting -logsPerStream=%d log entries per each -totalStreams=%d (-activeStreams=%d) on a time range -start=%s, -end=%s to -addr=%s",
httpserver.Errorf(w,r,"the query [%s] cannot be used in live tailing; see https://docs.victoriametrics.com/victorialogs/querying/#live-tailing for details",q)
maxConcurrentRequests=flag.Int("search.maxConcurrentRequests",getDefaultMaxConcurrentRequests(),"The maximum number of concurrent search requests. "+
"It shouldn't be high, since a single request can saturate all the CPU cores, while many concurrently executed requests may require high amounts of memory. "+
"See also -search.maxQueueDuration")
maxQueueDuration=flag.Duration("search.maxQueueDuration",10*time.Second,"The maximum time the search request waits for execution when -search.maxConcurrentRequests "+
"limit is reached; see also -search.maxQueryDuration")
maxQueryDuration=flag.Duration("search.maxQueryDuration",time.Second*30,"The maximum duration for query execution. It can be overridden on a per-query basis via 'timeout' query arg")
)
funcgetDefaultMaxConcurrentRequests()int{
n:=cgroup.AvailableCPUs()
ifn<=4{
n*=2
}
ifn>16{
// A single request can saturate all the CPU cores, so there is no sense
// in allowing higher number of concurrent requests - they will just contend
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.