53 Commits

Author SHA1 Message Date
Kyle Brennan
598b5372e8 [obs] Refactor alerts for image builds
For the last 30 days:

GitpodImagebuildDoneSuccess would have triggered once, on January 26 if set to 2h, instead of 4h.  A customer was potentially struggling with the outer loop. We hit a 60% error rate in Pyrra briefly: https://pyrra.gitpod.io/objectives?expr={__name__=%22workspace-imagebuild-buildsdone-success-ratio%22,%20namespace=%22monitoring-central%22,%20team=%22workspace%22}&grouping={}&from=1673297716785&to=1675716916785

GitpodImagebuildStartSuccess would have fired once, on Jan 8, when GCP was having scaling issues, and would have been correct to do so. https://gitpod.slack.com/archives/C01TNS8EVQT/p1673173223060219

Removed the warnings because they're unnecessary. Why? Pyrra sends them now for SLOs to #team-workspace-alerts.
2023-02-16 14:51:21 +01:00
utam0k
33e6d1f540 obs: Make AutoscaleFailure ago down to warning level 2023-01-20 06:20:27 +01:00
Christian Weichel
478a75e744 Switch license to AGPL 2022-12-08 13:05:19 -03:00
Pavel Tumik @ GitPod
11b9774e3a [alerts] improve autoscale alert to provide actual reason for failure in alert message 2022-12-07 22:49:17 -03:00
Kyle Brennan
e845faae3c Update operations/observability/mixins/workspace/rules/central/nodes.yaml
Co-authored-by: Wouter Verlaek <wouter@gitpod.io>
2022-12-02 05:56:01 -03:00
Kyle Brennan
171ec14d53 Add alerts for image build success rate 2022-12-02 05:56:01 -03:00
Kyle Brennan
fc3586b5e2 Change GitpodWorkspaceStuckOnStarting to GitpodWorkspaceStuckOnStopped 2022-12-02 05:56:01 -03:00
Kyle Brennan
603446291f Reduce noise with GitpodWorkspaceNodeHighNormalizedLoadAverage now that we have network limiting and PSI so its more actionable 2022-12-02 05:56:01 -03:00
Thomas Schubart
6469258f28 Update StuckOnStopping allert 2022-11-21 14:19:51 -03:00
ArthurSens
c0c6b3a150 Fix syntax errors
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
2022-11-09 17:58:39 +02:00
Kyle Brennan
6ff821261d Reduce noise for GitpodWorkspaceNodeHighNormalizedLoadAverage 2022-11-07 23:10:37 +01:00
ArthurSens
ebed98a31c Split workspace alerts into central and satellite
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
2022-11-07 23:10:37 +01:00
ArthurSens
88bfdb998a Prepare workspace alerts to centralized alerting
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
2022-11-07 23:10:37 +01:00
Thomas Schubart
f668ab404f [obs] Remove NetworkConnectionsTooHigh alert 2022-10-11 04:08:26 +02:00
Thomas Schubart
e99d002c5a Revert "fix network connections alert to fire only for workspace pods"
This reverts commit 83d4edba28efbe99a0c00d1d26e747b3824ee3c7.
2022-09-29 11:34:29 +02:00
Pavel Tumik @ GitPod
83d4edba28 fix network connections alert to fire only for workspace pods 2022-09-29 03:49:29 +02:00
Pavel Tumik @ GitPod
ea8fbdc4dd fix prometheus rule for workspaces 2022-09-21 22:16:22 +02:00
ArthurSens
9b382b6f69 Fix PrometheusRule name
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
2022-09-08 20:31:23 +02:00
ArthurSens
7c354c9a38 Replace workspace alerts from jsonnet to YAML
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
2022-09-08 15:46:23 +02:00
Pavel Tumik @ GitPod
cc79d75a96 [alerts] increase GitpodWorkspaceStuckOnStopping for time to 30min to reduce flakiness 2022-08-23 20:32:39 +02:00
JenTing Hsiao
d5462c0d02 observability: add #workspace > 20 in alert GitpodWorkspaceTooManyRegularNotActive
To prevent the alert from being triggered once we start traffic shifting.
The number of workspaces might be low, this cause the
gitpod_workspace_regular_not_active_percentage is easily to hit because
the gitpod_ws_manager_workspace_activity_total is low number.

Therefore, we add #workspace > 20 as another criterion for the alert.

Signed-off-by: JenTing Hsiao <hsiaoairplane@gmail.com>
2022-08-11 16:35:28 +02:00
utam0k
2d1f66ae25 observability: Add a alert for the network connections. 2022-08-10 05:55:54 +02:00
Pavel Tumik
06a686acf1 [alerts] change load avg alert to critical 2022-08-05 16:11:49 -03:00
Arthur Silva Sens
cd28f4c34d Route GitpodWorkspaceStuckOnStarting to #t_workspace_alerts 2022-07-26 14:15:21 -03:00
Pudong Zheng
25c5bfbecb [alerts] change alert for adding new nodes rapidly to only count if node type is regular workspace 2022-07-18 03:40:13 +02:00
Nandaja Varma
d13bdfd0cd Improve GitpodWorkspaceTooManyRegularNotActive alert 2022-07-11 06:09:58 +05:30
utam0k
04d945d216 obserbility: Add a alert for AutoscaleFailure. 2022-07-06 00:36:52 +05:30
JenTing Hsiao
7800a21c4d [alerts] fix pod/container/namespace not rendering
Because every time series is uniquely identified by its metric name
a set of labels, and every unique combination of key-value label pairs
represents a new alert for this time series.

There is no common value for these metrics
- kube_pod_container_status_restarts_total
- gitpod_ws_manager_workspace_backups_failure_total

Signed-off-by: JenTing Hsiao <hsiaoairplane@gmail.com>
2022-07-01 06:23:39 +05:30
utam0k
6c2705fbe4 observability: Ring the phone only when a data loss occurs with GitpodWsDaemonCrashLoopingg 2022-06-23 19:06:32 +05:30
Pavel Tumik
cf35903aff Apply suggestions from code review 2022-06-23 02:46:31 +05:30
Pavel Tumik
7e0fe457fb Apply suggestions from code review 2022-06-23 02:46:31 +05:30
utam0k
62859996d5 observability: Add GitpodWorkspaceTooLongTerminating alert. 2022-06-23 02:46:31 +05:30
Jan Keromnes
1858e5d61b Remove critical alert GitpodWsDaemonExcessiveGC > 60s (but keep the non-critical warning for now) 2022-06-16 20:56:26 +05:30
Moritz Eysholdt
65beac91df Fix runbook URL 2022-06-03 22:59:52 +05:30
Pavel Tumik
ad8d971176 [alerts] add alert when autoscaler adds nodes rapidly 2022-05-19 12:55:34 +05:30
Prince Rachit Sinha
5045e85f2a [observability] Add alerts for pending phase 2022-05-02 16:39:18 +05:30
Prince Rachit Sinha
de0c0e80a4 [observability] Add GitpodWorkspacesNotStarting alert 2022-04-26 08:53:38 +05:30
Prince Rachit Sinha
64fbd1e841 [observability] Add alert rule for high ws failure 2022-04-18 22:09:31 +05:30
Prince Rachit Sinha
aea24d85f8 Update runbook url for GitpodWsDaemonExcessiveGC 2022-03-01 21:19:08 +05:30
Pavel Tumik
ebb2a33667 change alert period for ws stuck on starting or stopping to 20m 2022-03-01 06:22:08 +05:30
Manuel Alejandro de Brito Fontes
90fe82a508 Remove ghost from the codebase 2022-02-28 14:17:07 +05:30
Prince Rachit Sinha
a48e177120 Add alerts for excessive GC of ws-daemon 2022-02-14 11:25:35 +01:00
Prince Rachit Sinha
95592d00d8 Update run book ref for GitpodWorkspaceTooManyRegularNotActive 2022-02-09 20:04:31 +01:00
Prince Rachit Sinha
2a3e4d60f3 Update GitpodWorkspaceTooManyRegularNotActive severity level 2022-02-09 20:04:31 +01:00
Kyle Brennan
71f543110f Trigger node high load warnings sooner 2022-01-21 22:24:14 +01:00
Thomas Schubart
c62ec6633b Trigger node high load warnings sooner 2022-01-21 17:41:13 +01:00
ArthurSens
1f6195853b Add alert for normalized Load average higher than 10.
The same recording rule is also added to Gitpod / Overview dashboard, replacing the noisy neighborhood panel

Signed-off-by: ArthurSens <arthursens2005@gmail.com>
2021-11-15 08:36:12 +01:00
ArthurSens
f173d2adcd [observability] Fix jsonnet format check
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
2021-10-21 16:56:59 +02:00
ArthurSens
560a34a2cf Update runbooks' URL
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
2021-10-19 12:37:06 -03:00
Christian Weichel
ddc37ce439 [observability] Add SLO for "regular not active" 2021-10-15 10:38:02 -03:00