Kyle Brennan
598b5372e8
[obs] Refactor alerts for image builds
...
For the last 30 days:
GitpodImagebuildDoneSuccess would have triggered once, on January 26 if set to 2h, instead of 4h. A customer was potentially struggling with the outer loop. We hit a 60% error rate in Pyrra briefly: https://pyrra.gitpod.io/objectives?expr={__name__=%22workspace-imagebuild-buildsdone-success-ratio%22,%20namespace=%22monitoring-central%22,%20team=%22workspace%22}&grouping={}&from=1673297716785&to=1675716916785
GitpodImagebuildStartSuccess would have fired once, on Jan 8, when GCP was having scaling issues, and would have been correct to do so. https://gitpod.slack.com/archives/C01TNS8EVQT/p1673173223060219
Removed the warnings because they're unnecessary. Why? Pyrra sends them now for SLOs to #team-workspace-alerts.
2023-02-16 14:51:21 +01:00
utam0k
33e6d1f540
obs: Make AutoscaleFailure ago down to warning level
2023-01-20 06:20:27 +01:00
Christian Weichel
478a75e744
Switch license to AGPL
2022-12-08 13:05:19 -03:00
Pavel Tumik @ GitPod
11b9774e3a
[alerts] improve autoscale alert to provide actual reason for failure in alert message
2022-12-07 22:49:17 -03:00
Kyle Brennan
e845faae3c
Update operations/observability/mixins/workspace/rules/central/nodes.yaml
...
Co-authored-by: Wouter Verlaek <wouter@gitpod.io>
2022-12-02 05:56:01 -03:00
Kyle Brennan
171ec14d53
Add alerts for image build success rate
2022-12-02 05:56:01 -03:00
Kyle Brennan
fc3586b5e2
Change GitpodWorkspaceStuckOnStarting to GitpodWorkspaceStuckOnStopped
2022-12-02 05:56:01 -03:00
Kyle Brennan
603446291f
Reduce noise with GitpodWorkspaceNodeHighNormalizedLoadAverage now that we have network limiting and PSI so its more actionable
2022-12-02 05:56:01 -03:00
Thomas Schubart
6469258f28
Update StuckOnStopping allert
2022-11-21 14:19:51 -03:00
ArthurSens
c0c6b3a150
Fix syntax errors
...
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
2022-11-09 17:58:39 +02:00
Kyle Brennan
6ff821261d
Reduce noise for GitpodWorkspaceNodeHighNormalizedLoadAverage
2022-11-07 23:10:37 +01:00
ArthurSens
ebed98a31c
Split workspace alerts into central and satellite
...
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
2022-11-07 23:10:37 +01:00
ArthurSens
88bfdb998a
Prepare workspace alerts to centralized alerting
...
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
2022-11-07 23:10:37 +01:00
Thomas Schubart
f668ab404f
[obs] Remove NetworkConnectionsTooHigh alert
2022-10-11 04:08:26 +02:00
Thomas Schubart
e99d002c5a
Revert "fix network connections alert to fire only for workspace pods"
...
This reverts commit 83d4edba28efbe99a0c00d1d26e747b3824ee3c7.
2022-09-29 11:34:29 +02:00
Pavel Tumik @ GitPod
83d4edba28
fix network connections alert to fire only for workspace pods
2022-09-29 03:49:29 +02:00
Pavel Tumik @ GitPod
ea8fbdc4dd
fix prometheus rule for workspaces
2022-09-21 22:16:22 +02:00
ArthurSens
9b382b6f69
Fix PrometheusRule name
...
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
2022-09-08 20:31:23 +02:00
ArthurSens
7c354c9a38
Replace workspace alerts from jsonnet to YAML
...
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
2022-09-08 15:46:23 +02:00
Pavel Tumik @ GitPod
cc79d75a96
[alerts] increase GitpodWorkspaceStuckOnStopping for time to 30min to reduce flakiness
2022-08-23 20:32:39 +02:00
JenTing Hsiao
d5462c0d02
observability: add #workspace > 20 in alert GitpodWorkspaceTooManyRegularNotActive
...
To prevent the alert from being triggered once we start traffic shifting.
The number of workspaces might be low, this cause the
gitpod_workspace_regular_not_active_percentage is easily to hit because
the gitpod_ws_manager_workspace_activity_total is low number.
Therefore, we add #workspace > 20 as another criterion for the alert.
Signed-off-by: JenTing Hsiao <hsiaoairplane@gmail.com>
2022-08-11 16:35:28 +02:00
utam0k
2d1f66ae25
observability: Add a alert for the network connections.
2022-08-10 05:55:54 +02:00
Pavel Tumik
06a686acf1
[alerts] change load avg alert to critical
2022-08-05 16:11:49 -03:00
Arthur Silva Sens
cd28f4c34d
Route GitpodWorkspaceStuckOnStarting to #t_workspace_alerts
2022-07-26 14:15:21 -03:00
Pudong Zheng
25c5bfbecb
[alerts] change alert for adding new nodes rapidly to only count if node type is regular workspace
2022-07-18 03:40:13 +02:00
Nandaja Varma
d13bdfd0cd
Improve GitpodWorkspaceTooManyRegularNotActive alert
2022-07-11 06:09:58 +05:30
utam0k
04d945d216
obserbility: Add a alert for AutoscaleFailure.
2022-07-06 00:36:52 +05:30
JenTing Hsiao
7800a21c4d
[alerts] fix pod/container/namespace not rendering
...
Because every time series is uniquely identified by its metric name
a set of labels, and every unique combination of key-value label pairs
represents a new alert for this time series.
There is no common value for these metrics
- kube_pod_container_status_restarts_total
- gitpod_ws_manager_workspace_backups_failure_total
Signed-off-by: JenTing Hsiao <hsiaoairplane@gmail.com>
2022-07-01 06:23:39 +05:30
utam0k
6c2705fbe4
observability: Ring the phone only when a data loss occurs with GitpodWsDaemonCrashLoopingg
2022-06-23 19:06:32 +05:30
Pavel Tumik
cf35903aff
Apply suggestions from code review
2022-06-23 02:46:31 +05:30
Pavel Tumik
7e0fe457fb
Apply suggestions from code review
2022-06-23 02:46:31 +05:30
utam0k
62859996d5
observability: Add GitpodWorkspaceTooLongTerminating alert.
2022-06-23 02:46:31 +05:30
Jan Keromnes
1858e5d61b
Remove critical alert GitpodWsDaemonExcessiveGC > 60s (but keep the non-critical warning for now)
2022-06-16 20:56:26 +05:30
Moritz Eysholdt
65beac91df
Fix runbook URL
2022-06-03 22:59:52 +05:30
Pavel Tumik
ad8d971176
[alerts] add alert when autoscaler adds nodes rapidly
2022-05-19 12:55:34 +05:30
Prince Rachit Sinha
5045e85f2a
[observability] Add alerts for pending phase
2022-05-02 16:39:18 +05:30
Prince Rachit Sinha
de0c0e80a4
[observability] Add GitpodWorkspacesNotStarting alert
2022-04-26 08:53:38 +05:30
Prince Rachit Sinha
64fbd1e841
[observability] Add alert rule for high ws failure
2022-04-18 22:09:31 +05:30
Prince Rachit Sinha
aea24d85f8
Update runbook url for GitpodWsDaemonExcessiveGC
2022-03-01 21:19:08 +05:30
Pavel Tumik
ebb2a33667
change alert period for ws stuck on starting or stopping to 20m
2022-03-01 06:22:08 +05:30
Manuel Alejandro de Brito Fontes
90fe82a508
Remove ghost from the codebase
2022-02-28 14:17:07 +05:30
Prince Rachit Sinha
a48e177120
Add alerts for excessive GC of ws-daemon
2022-02-14 11:25:35 +01:00
Prince Rachit Sinha
95592d00d8
Update run book ref for GitpodWorkspaceTooManyRegularNotActive
2022-02-09 20:04:31 +01:00
Prince Rachit Sinha
2a3e4d60f3
Update GitpodWorkspaceTooManyRegularNotActive severity level
2022-02-09 20:04:31 +01:00
Kyle Brennan
71f543110f
Trigger node high load warnings sooner
2022-01-21 22:24:14 +01:00
Thomas Schubart
c62ec6633b
Trigger node high load warnings sooner
2022-01-21 17:41:13 +01:00
ArthurSens
1f6195853b
Add alert for normalized Load average higher than 10.
...
The same recording rule is also added to Gitpod / Overview dashboard, replacing the noisy neighborhood panel
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
2021-11-15 08:36:12 +01:00
ArthurSens
f173d2adcd
[observability] Fix jsonnet format check
...
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
2021-10-21 16:56:59 +02:00
ArthurSens
560a34a2cf
Update runbooks' URL
...
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
2021-10-19 12:37:06 -03:00
Christian Weichel
ddc37ce439
[observability] Add SLO for "regular not active"
2021-10-15 10:38:02 -03:00