Kyle Brennan
598b5372e8
[obs] Refactor alerts for image builds
...
For the last 30 days:
GitpodImagebuildDoneSuccess would have triggered once, on January 26 if set to 2h, instead of 4h. A customer was potentially struggling with the outer loop. We hit a 60% error rate in Pyrra briefly: https://pyrra.gitpod.io/objectives?expr={__name__=%22workspace-imagebuild-buildsdone-success-ratio%22,%20namespace=%22monitoring-central%22,%20team=%22workspace%22}&grouping={}&from=1673297716785&to=1675716916785
GitpodImagebuildStartSuccess would have fired once, on Jan 8, when GCP was having scaling issues, and would have been correct to do so. https://gitpod.slack.com/archives/C01TNS8EVQT/p1673173223060219
Removed the warnings because they're unnecessary. Why? Pyrra sends them now for SLOs to #team-workspace-alerts.
2023-02-16 14:51:21 +01:00
utam0k
33e6d1f540
obs: Make AutoscaleFailure ago down to warning level
2023-01-20 06:20:27 +01:00
Wouter Verlaek
e3ce970423
[observability] Add image build rate panels
2023-01-09 17:00:48 +01:00
Kyle Brennan
f08784fbc8
[obs] fix image-builder-mk3 dashboard
2022-12-26 02:24:34 +01:00
Kyle Brennan
c01d43b809
[obs] move blobserve from Workspace to IDE
2022-12-26 02:22:34 +01:00
Pudong Zheng
fc6355a8d2
[observability] fix datasource in preview environment
2022-12-09 06:54:19 -03:00
Christian Weichel
478a75e744
Switch license to AGPL
2022-12-08 13:05:19 -03:00
Pavel Tumik @ GitPod
11b9774e3a
[alerts] improve autoscale alert to provide actual reason for failure in alert message
2022-12-07 22:49:17 -03:00
Kyle Brennan
e845faae3c
Update operations/observability/mixins/workspace/rules/central/nodes.yaml
...
Co-authored-by: Wouter Verlaek <wouter@gitpod.io>
2022-12-02 05:56:01 -03:00
Kyle Brennan
171ec14d53
Add alerts for image build success rate
2022-12-02 05:56:01 -03:00
Kyle Brennan
fc3586b5e2
Change GitpodWorkspaceStuckOnStarting to GitpodWorkspaceStuckOnStopped
2022-12-02 05:56:01 -03:00
Kyle Brennan
603446291f
Reduce noise with GitpodWorkspaceNodeHighNormalizedLoadAverage now that we have network limiting and PSI so its more actionable
2022-12-02 05:56:01 -03:00
Thomas Schubart
6469258f28
Update StuckOnStopping allert
2022-11-21 14:19:51 -03:00
ArthurSens
c0c6b3a150
Fix syntax errors
...
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
2022-11-09 17:58:39 +02:00
Kyle Brennan
6ff821261d
Reduce noise for GitpodWorkspaceNodeHighNormalizedLoadAverage
2022-11-07 23:10:37 +01:00
ArthurSens
ebed98a31c
Split workspace alerts into central and satellite
...
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
2022-11-07 23:10:37 +01:00
ArthurSens
88bfdb998a
Prepare workspace alerts to centralized alerting
...
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
2022-11-07 23:10:37 +01:00
JenTing Hsiao
7e2f4b166d
obs: fix new workspace does not shown workspace success rate
...
Signed-off-by: JenTing Hsiao <hsiaoairplane@gmail.com>
2022-11-04 04:47:08 +01:00
Manuel Alejandro de Brito Fontes
d01040cb19
Fix registry facade blobSource dashboard and add link
2022-11-04 00:07:08 +01:00
Thomas Schubart
f31bbd2ca9
[obs] Rename psi dashboard to node psi
2022-11-01 03:40:06 +01:00
Thomas Schubart
fb62393f1f
[obs] Create workspace psi dashboard
2022-11-01 03:40:06 +01:00
Manuel Alejandro de Brito Fontes
52848f6e18
[registry-facade] Add new blobSopurce dashboard
2022-10-25 23:35:40 +02:00
JenTing Hsiao
b9c841f2f5
obs: fix new workspace does not shown workspace success rate
...
Signed-off-by: JenTing Hsiao <hsiaoairplane@gmail.com>
2022-10-13 22:34:29 +02:00
Thomas Schubart
f668ab404f
[obs] Remove NetworkConnectionsTooHigh alert
2022-10-11 04:08:26 +02:00
Thomas Schubart
f4a71fa6cc
[obs] Dashboard for psi metrics
2022-10-04 00:34:19 +02:00
Thomas Schubart
e99d002c5a
Revert "fix network connections alert to fire only for workspace pods"
...
This reverts commit 83d4edba28efbe99a0c00d1d26e747b3824ee3c7.
2022-09-29 11:34:29 +02:00
Pavel Tumik @ GitPod
83d4edba28
fix network connections alert to fire only for workspace pods
2022-09-29 03:49:29 +02:00
Pavel Tumik @ GitPod
ea8fbdc4dd
fix prometheus rule for workspaces
2022-09-21 22:16:22 +02:00
Thomas Schubart
f61eacf1e8
[obs] Fix dashboard import error
2022-09-20 22:46:21 +02:00
Milan Pavlik
30cffea01a
[image-builder] Move dashboard to team Workspace
2022-09-19 20:24:20 +02:00
Thomas Schubart
bf5917f631
[obs] Add network limiting overview panel
2022-09-19 20:22:20 +02:00
Thomas Schubart
243ee21379
[obs] Fix display of network limiting stats
...
- Ensure data source is selected
- Use network limiting stats for sourcing workspace and node
2022-09-16 01:00:16 +02:00
Thomas Schubart
fc2b4422c6
Import network limiting dashboard
2022-09-09 13:59:24 +02:00
ArthurSens
9b382b6f69
Fix PrometheusRule name
...
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
2022-09-08 20:31:23 +02:00
ArthurSens
7c354c9a38
Replace workspace alerts from jsonnet to YAML
...
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
2022-09-08 15:46:23 +02:00
Thomas Schubart
ccb148f2a6
[observability] Add dashboard for network limiting
2022-08-31 10:27:15 +02:00
Pavel Tumik @ GitPod
cc79d75a96
[alerts] increase GitpodWorkspaceStuckOnStopping for time to 30min to reduce flakiness
2022-08-23 20:32:39 +02:00
JenTing Hsiao
d5462c0d02
observability: add #workspace > 20 in alert GitpodWorkspaceTooManyRegularNotActive
...
To prevent the alert from being triggered once we start traffic shifting.
The number of workspaces might be low, this cause the
gitpod_workspace_regular_not_active_percentage is easily to hit because
the gitpod_ws_manager_workspace_activity_total is low number.
Therefore, we add #workspace > 20 as another criterion for the alert.
Signed-off-by: JenTing Hsiao <hsiaoairplane@gmail.com>
2022-08-11 16:35:28 +02:00
utam0k
2d1f66ae25
observability: Add a alert for the network connections.
2022-08-10 05:55:54 +02:00
Pavel Tumik
06a686acf1
[alerts] change load avg alert to critical
2022-08-05 16:11:49 -03:00
Manuel Alejandro de Brito Fontes
a5dd648f06
Add dashboard for node problem detector
2022-07-26 16:05:21 -03:00
Manuel Alejandro de Brito Fontes
70eaa01676
Add dashboard for ephemeral storage
2022-07-26 15:24:21 -03:00
Arthur Silva Sens
cd28f4c34d
Route GitpodWorkspaceStuckOnStarting to #t_workspace_alerts
2022-07-26 14:15:21 -03:00
Manuel Alejandro de Brito Fontes
18c764cbac
Add dashboard for swap utilization per cluster and node
2022-07-19 19:40:14 -03:00
Pudong Zheng
25c5bfbecb
[alerts] change alert for adding new nodes rapidly to only count if node type is regular workspace
2022-07-18 03:40:13 +02:00
Nandaja Varma
d13bdfd0cd
Improve GitpodWorkspaceTooManyRegularNotActive alert
2022-07-11 06:09:58 +05:30
utam0k
04d945d216
obserbility: Add a alert for AutoscaleFailure.
2022-07-06 00:36:52 +05:30
JenTing Hsiao
7800a21c4d
[alerts] fix pod/container/namespace not rendering
...
Because every time series is uniquely identified by its metric name
a set of labels, and every unique combination of key-value label pairs
represents a new alert for this time series.
There is no common value for these metrics
- kube_pod_container_status_restarts_total
- gitpod_ws_manager_workspace_backups_failure_total
Signed-off-by: JenTing Hsiao <hsiaoairplane@gmail.com>
2022-07-01 06:23:39 +05:30
utam0k
6c2705fbe4
observability: Ring the phone only when a data loss occurs with GitpodWsDaemonCrashLoopingg
2022-06-23 19:06:32 +05:30
Pavel Tumik
cf35903aff
Apply suggestions from code review
2022-06-23 02:46:31 +05:30