100 Commits

Author SHA1 Message Date
Wouter Verlaek
a9810d6a0a
[ws-manager-mk2] Fix race where pod gets recreated in Stopped phase (#16622)
* [ws-manager-mk2] Fix race where pod gets recreated in Stopped phase

* [ws-manager-mk2] Add pod creation logs

* Change to Patch
2023-03-02 13:27:59 +01:00
Wouter Verlaek
cf0dd5571f
[ws-manager-mk2] Show start failures in dashboard, show daemon ctrl metrics (#16612) 2023-03-01 12:13:58 +01:00
Wouter Verlaek
d827a2b9dd
[ws-manager-mk2] Add queue depth and work duration panels (#16555) 2023-02-24 13:47:54 +01:00
Wouter Verlaek
733c37b2f8
[ws-manager-mk2] Import dashboard (#16532) 2023-02-23 15:12:53 +01:00
Wouter Verlaek
7440f00796
[ws-manager-mk2] Add Grafana dashboard (#16455)
* [ws-manager-mk2] Add Grafana dashboard

* [ws-manager-mk2] Add reconciliations by controller panel
2023-02-23 00:19:52 +01:00
Kyle Brennan
598b5372e8 [obs] Refactor alerts for image builds
For the last 30 days:

GitpodImagebuildDoneSuccess would have triggered once, on January 26 if set to 2h, instead of 4h.  A customer was potentially struggling with the outer loop. We hit a 60% error rate in Pyrra briefly: https://pyrra.gitpod.io/objectives?expr={__name__=%22workspace-imagebuild-buildsdone-success-ratio%22,%20namespace=%22monitoring-central%22,%20team=%22workspace%22}&grouping={}&from=1673297716785&to=1675716916785

GitpodImagebuildStartSuccess would have fired once, on Jan 8, when GCP was having scaling issues, and would have been correct to do so. https://gitpod.slack.com/archives/C01TNS8EVQT/p1673173223060219

Removed the warnings because they're unnecessary. Why? Pyrra sends them now for SLOs to #team-workspace-alerts.
2023-02-16 14:51:21 +01:00
utam0k
33e6d1f540 obs: Make AutoscaleFailure ago down to warning level 2023-01-20 06:20:27 +01:00
Wouter Verlaek
e3ce970423 [observability] Add image build rate panels 2023-01-09 17:00:48 +01:00
Kyle Brennan
f08784fbc8 [obs] fix image-builder-mk3 dashboard 2022-12-26 02:24:34 +01:00
Kyle Brennan
c01d43b809 [obs] move blobserve from Workspace to IDE 2022-12-26 02:22:34 +01:00
Pudong Zheng
fc6355a8d2 [observability] fix datasource in preview environment 2022-12-09 06:54:19 -03:00
Christian Weichel
478a75e744 Switch license to AGPL 2022-12-08 13:05:19 -03:00
Pavel Tumik @ GitPod
11b9774e3a [alerts] improve autoscale alert to provide actual reason for failure in alert message 2022-12-07 22:49:17 -03:00
Kyle Brennan
e845faae3c Update operations/observability/mixins/workspace/rules/central/nodes.yaml
Co-authored-by: Wouter Verlaek <wouter@gitpod.io>
2022-12-02 05:56:01 -03:00
Kyle Brennan
171ec14d53 Add alerts for image build success rate 2022-12-02 05:56:01 -03:00
Kyle Brennan
fc3586b5e2 Change GitpodWorkspaceStuckOnStarting to GitpodWorkspaceStuckOnStopped 2022-12-02 05:56:01 -03:00
Kyle Brennan
603446291f Reduce noise with GitpodWorkspaceNodeHighNormalizedLoadAverage now that we have network limiting and PSI so its more actionable 2022-12-02 05:56:01 -03:00
Thomas Schubart
6469258f28 Update StuckOnStopping allert 2022-11-21 14:19:51 -03:00
ArthurSens
c0c6b3a150 Fix syntax errors
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
2022-11-09 17:58:39 +02:00
Kyle Brennan
6ff821261d Reduce noise for GitpodWorkspaceNodeHighNormalizedLoadAverage 2022-11-07 23:10:37 +01:00
ArthurSens
ebed98a31c Split workspace alerts into central and satellite
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
2022-11-07 23:10:37 +01:00
ArthurSens
88bfdb998a Prepare workspace alerts to centralized alerting
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
2022-11-07 23:10:37 +01:00
JenTing Hsiao
7e2f4b166d obs: fix new workspace does not shown workspace success rate
Signed-off-by: JenTing Hsiao <hsiaoairplane@gmail.com>
2022-11-04 04:47:08 +01:00
Manuel Alejandro de Brito Fontes
d01040cb19 Fix registry facade blobSource dashboard and add link 2022-11-04 00:07:08 +01:00
Thomas Schubart
f31bbd2ca9 [obs] Rename psi dashboard to node psi 2022-11-01 03:40:06 +01:00
Thomas Schubart
fb62393f1f [obs] Create workspace psi dashboard 2022-11-01 03:40:06 +01:00
Manuel Alejandro de Brito Fontes
52848f6e18 [registry-facade] Add new blobSopurce dashboard 2022-10-25 23:35:40 +02:00
JenTing Hsiao
b9c841f2f5 obs: fix new workspace does not shown workspace success rate
Signed-off-by: JenTing Hsiao <hsiaoairplane@gmail.com>
2022-10-13 22:34:29 +02:00
Thomas Schubart
f668ab404f [obs] Remove NetworkConnectionsTooHigh alert 2022-10-11 04:08:26 +02:00
Thomas Schubart
f4a71fa6cc [obs] Dashboard for psi metrics 2022-10-04 00:34:19 +02:00
Thomas Schubart
e99d002c5a Revert "fix network connections alert to fire only for workspace pods"
This reverts commit 83d4edba28efbe99a0c00d1d26e747b3824ee3c7.
2022-09-29 11:34:29 +02:00
Pavel Tumik @ GitPod
83d4edba28 fix network connections alert to fire only for workspace pods 2022-09-29 03:49:29 +02:00
Pavel Tumik @ GitPod
ea8fbdc4dd fix prometheus rule for workspaces 2022-09-21 22:16:22 +02:00
Thomas Schubart
f61eacf1e8 [obs] Fix dashboard import error 2022-09-20 22:46:21 +02:00
Milan Pavlik
30cffea01a [image-builder] Move dashboard to team Workspace 2022-09-19 20:24:20 +02:00
Thomas Schubart
bf5917f631 [obs] Add network limiting overview panel 2022-09-19 20:22:20 +02:00
Thomas Schubart
243ee21379 [obs] Fix display of network limiting stats
- Ensure data source is selected
- Use network limiting stats for sourcing workspace and node
2022-09-16 01:00:16 +02:00
Thomas Schubart
fc2b4422c6 Import network limiting dashboard 2022-09-09 13:59:24 +02:00
ArthurSens
9b382b6f69 Fix PrometheusRule name
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
2022-09-08 20:31:23 +02:00
ArthurSens
7c354c9a38 Replace workspace alerts from jsonnet to YAML
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
2022-09-08 15:46:23 +02:00
Thomas Schubart
ccb148f2a6 [observability] Add dashboard for network limiting 2022-08-31 10:27:15 +02:00
Pavel Tumik @ GitPod
cc79d75a96 [alerts] increase GitpodWorkspaceStuckOnStopping for time to 30min to reduce flakiness 2022-08-23 20:32:39 +02:00
JenTing Hsiao
d5462c0d02 observability: add #workspace > 20 in alert GitpodWorkspaceTooManyRegularNotActive
To prevent the alert from being triggered once we start traffic shifting.
The number of workspaces might be low, this cause the
gitpod_workspace_regular_not_active_percentage is easily to hit because
the gitpod_ws_manager_workspace_activity_total is low number.

Therefore, we add #workspace > 20 as another criterion for the alert.

Signed-off-by: JenTing Hsiao <hsiaoairplane@gmail.com>
2022-08-11 16:35:28 +02:00
utam0k
2d1f66ae25 observability: Add a alert for the network connections. 2022-08-10 05:55:54 +02:00
Pavel Tumik
06a686acf1 [alerts] change load avg alert to critical 2022-08-05 16:11:49 -03:00
Manuel Alejandro de Brito Fontes
a5dd648f06 Add dashboard for node problem detector 2022-07-26 16:05:21 -03:00
Manuel Alejandro de Brito Fontes
70eaa01676 Add dashboard for ephemeral storage 2022-07-26 15:24:21 -03:00
Arthur Silva Sens
cd28f4c34d Route GitpodWorkspaceStuckOnStarting to #t_workspace_alerts 2022-07-26 14:15:21 -03:00
Manuel Alejandro de Brito Fontes
18c764cbac Add dashboard for swap utilization per cluster and node 2022-07-19 19:40:14 -03:00
Pudong Zheng
25c5bfbecb [alerts] change alert for adding new nodes rapidly to only count if node type is regular workspace 2022-07-18 03:40:13 +02:00