95 Commits

Author SHA1 Message Date
Kyle Brennan
598b5372e8 [obs] Refactor alerts for image builds
For the last 30 days:

GitpodImagebuildDoneSuccess would have triggered once, on January 26 if set to 2h, instead of 4h.  A customer was potentially struggling with the outer loop. We hit a 60% error rate in Pyrra briefly: https://pyrra.gitpod.io/objectives?expr={__name__=%22workspace-imagebuild-buildsdone-success-ratio%22,%20namespace=%22monitoring-central%22,%20team=%22workspace%22}&grouping={}&from=1673297716785&to=1675716916785

GitpodImagebuildStartSuccess would have fired once, on Jan 8, when GCP was having scaling issues, and would have been correct to do so. https://gitpod.slack.com/archives/C01TNS8EVQT/p1673173223060219

Removed the warnings because they're unnecessary. Why? Pyrra sends them now for SLOs to #team-workspace-alerts.
2023-02-16 14:51:21 +01:00
utam0k
33e6d1f540 obs: Make AutoscaleFailure ago down to warning level 2023-01-20 06:20:27 +01:00
Wouter Verlaek
e3ce970423 [observability] Add image build rate panels 2023-01-09 17:00:48 +01:00
Kyle Brennan
f08784fbc8 [obs] fix image-builder-mk3 dashboard 2022-12-26 02:24:34 +01:00
Kyle Brennan
c01d43b809 [obs] move blobserve from Workspace to IDE 2022-12-26 02:22:34 +01:00
Pudong Zheng
fc6355a8d2 [observability] fix datasource in preview environment 2022-12-09 06:54:19 -03:00
Christian Weichel
478a75e744 Switch license to AGPL 2022-12-08 13:05:19 -03:00
Pavel Tumik @ GitPod
11b9774e3a [alerts] improve autoscale alert to provide actual reason for failure in alert message 2022-12-07 22:49:17 -03:00
Kyle Brennan
e845faae3c Update operations/observability/mixins/workspace/rules/central/nodes.yaml
Co-authored-by: Wouter Verlaek <wouter@gitpod.io>
2022-12-02 05:56:01 -03:00
Kyle Brennan
171ec14d53 Add alerts for image build success rate 2022-12-02 05:56:01 -03:00
Kyle Brennan
fc3586b5e2 Change GitpodWorkspaceStuckOnStarting to GitpodWorkspaceStuckOnStopped 2022-12-02 05:56:01 -03:00
Kyle Brennan
603446291f Reduce noise with GitpodWorkspaceNodeHighNormalizedLoadAverage now that we have network limiting and PSI so its more actionable 2022-12-02 05:56:01 -03:00
Thomas Schubart
6469258f28 Update StuckOnStopping allert 2022-11-21 14:19:51 -03:00
ArthurSens
c0c6b3a150 Fix syntax errors
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
2022-11-09 17:58:39 +02:00
Kyle Brennan
6ff821261d Reduce noise for GitpodWorkspaceNodeHighNormalizedLoadAverage 2022-11-07 23:10:37 +01:00
ArthurSens
ebed98a31c Split workspace alerts into central and satellite
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
2022-11-07 23:10:37 +01:00
ArthurSens
88bfdb998a Prepare workspace alerts to centralized alerting
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
2022-11-07 23:10:37 +01:00
JenTing Hsiao
7e2f4b166d obs: fix new workspace does not shown workspace success rate
Signed-off-by: JenTing Hsiao <hsiaoairplane@gmail.com>
2022-11-04 04:47:08 +01:00
Manuel Alejandro de Brito Fontes
d01040cb19 Fix registry facade blobSource dashboard and add link 2022-11-04 00:07:08 +01:00
Thomas Schubart
f31bbd2ca9 [obs] Rename psi dashboard to node psi 2022-11-01 03:40:06 +01:00
Thomas Schubart
fb62393f1f [obs] Create workspace psi dashboard 2022-11-01 03:40:06 +01:00
Manuel Alejandro de Brito Fontes
52848f6e18 [registry-facade] Add new blobSopurce dashboard 2022-10-25 23:35:40 +02:00
JenTing Hsiao
b9c841f2f5 obs: fix new workspace does not shown workspace success rate
Signed-off-by: JenTing Hsiao <hsiaoairplane@gmail.com>
2022-10-13 22:34:29 +02:00
Thomas Schubart
f668ab404f [obs] Remove NetworkConnectionsTooHigh alert 2022-10-11 04:08:26 +02:00
Thomas Schubart
f4a71fa6cc [obs] Dashboard for psi metrics 2022-10-04 00:34:19 +02:00
Thomas Schubart
e99d002c5a Revert "fix network connections alert to fire only for workspace pods"
This reverts commit 83d4edba28efbe99a0c00d1d26e747b3824ee3c7.
2022-09-29 11:34:29 +02:00
Pavel Tumik @ GitPod
83d4edba28 fix network connections alert to fire only for workspace pods 2022-09-29 03:49:29 +02:00
Pavel Tumik @ GitPod
ea8fbdc4dd fix prometheus rule for workspaces 2022-09-21 22:16:22 +02:00
Thomas Schubart
f61eacf1e8 [obs] Fix dashboard import error 2022-09-20 22:46:21 +02:00
Milan Pavlik
30cffea01a [image-builder] Move dashboard to team Workspace 2022-09-19 20:24:20 +02:00
Thomas Schubart
bf5917f631 [obs] Add network limiting overview panel 2022-09-19 20:22:20 +02:00
Thomas Schubart
243ee21379 [obs] Fix display of network limiting stats
- Ensure data source is selected
- Use network limiting stats for sourcing workspace and node
2022-09-16 01:00:16 +02:00
Thomas Schubart
fc2b4422c6 Import network limiting dashboard 2022-09-09 13:59:24 +02:00
ArthurSens
9b382b6f69 Fix PrometheusRule name
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
2022-09-08 20:31:23 +02:00
ArthurSens
7c354c9a38 Replace workspace alerts from jsonnet to YAML
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
2022-09-08 15:46:23 +02:00
Thomas Schubart
ccb148f2a6 [observability] Add dashboard for network limiting 2022-08-31 10:27:15 +02:00
Pavel Tumik @ GitPod
cc79d75a96 [alerts] increase GitpodWorkspaceStuckOnStopping for time to 30min to reduce flakiness 2022-08-23 20:32:39 +02:00
JenTing Hsiao
d5462c0d02 observability: add #workspace > 20 in alert GitpodWorkspaceTooManyRegularNotActive
To prevent the alert from being triggered once we start traffic shifting.
The number of workspaces might be low, this cause the
gitpod_workspace_regular_not_active_percentage is easily to hit because
the gitpod_ws_manager_workspace_activity_total is low number.

Therefore, we add #workspace > 20 as another criterion for the alert.

Signed-off-by: JenTing Hsiao <hsiaoairplane@gmail.com>
2022-08-11 16:35:28 +02:00
utam0k
2d1f66ae25 observability: Add a alert for the network connections. 2022-08-10 05:55:54 +02:00
Pavel Tumik
06a686acf1 [alerts] change load avg alert to critical 2022-08-05 16:11:49 -03:00
Manuel Alejandro de Brito Fontes
a5dd648f06 Add dashboard for node problem detector 2022-07-26 16:05:21 -03:00
Manuel Alejandro de Brito Fontes
70eaa01676 Add dashboard for ephemeral storage 2022-07-26 15:24:21 -03:00
Arthur Silva Sens
cd28f4c34d Route GitpodWorkspaceStuckOnStarting to #t_workspace_alerts 2022-07-26 14:15:21 -03:00
Manuel Alejandro de Brito Fontes
18c764cbac Add dashboard for swap utilization per cluster and node 2022-07-19 19:40:14 -03:00
Pudong Zheng
25c5bfbecb [alerts] change alert for adding new nodes rapidly to only count if node type is regular workspace 2022-07-18 03:40:13 +02:00
Nandaja Varma
d13bdfd0cd Improve GitpodWorkspaceTooManyRegularNotActive alert 2022-07-11 06:09:58 +05:30
utam0k
04d945d216 obserbility: Add a alert for AutoscaleFailure. 2022-07-06 00:36:52 +05:30
JenTing Hsiao
7800a21c4d [alerts] fix pod/container/namespace not rendering
Because every time series is uniquely identified by its metric name
a set of labels, and every unique combination of key-value label pairs
represents a new alert for this time series.

There is no common value for these metrics
- kube_pod_container_status_restarts_total
- gitpod_ws_manager_workspace_backups_failure_total

Signed-off-by: JenTing Hsiao <hsiaoairplane@gmail.com>
2022-07-01 06:23:39 +05:30
utam0k
6c2705fbe4 observability: Ring the phone only when a data loss occurs with GitpodWsDaemonCrashLoopingg 2022-06-23 19:06:32 +05:30
Pavel Tumik
cf35903aff Apply suggestions from code review 2022-06-23 02:46:31 +05:30