Kyle Brennan
696ec03449
[obs] fix alerts for workspace deployments ( #19502 )
...
The proper label to use with `kube_deployment_spec_replicas` is deployment.
`container` is okay with `kube_pod_container_status_restarts_total`
2024-03-05 01:49:14 +02:00
Gero Posmyk-Leinemann
d40119977c
GitpodWorkspaceTooManyRegularNotActiveMk2: add lower bound of 20 ( #19439 )
2024-02-19 11:53:00 +02:00
Kyle Brennan
31db761ae7
[obs] tune GitpodWorkspaceStuckOnStoppingMk2 alert ( #19299 )
...
This alert fired 8 times for us107 since Dec 23, none of the pages required action from the on-call operator.
Let's make it more difficult for the alert to fire, to avoid unnecessary escalations to on-call.
2023-12-31 22:30:38 +02:00
Mads Hartmann
3ff9173705
This adds an alert for GitpodWorkspaceHighStartFailureRate ( #19099 )
2023-11-21 14:07:58 +02:00
Wouter Verlaek
110defe741
Use humanize1024 instead ( #19036 )
2023-11-08 11:19:45 +02:00
Wouter Verlaek
7af72bb539
Add GiB left in storage warning ( #19035 )
2023-11-08 10:45:45 +02:00
Kyle Brennan
5b13b510ec
[obs] remove GitpodImagebuildStartSuccess warning ( #19002 )
...
This expression has dips regularly, and does not provide value as a notification in its current form.
2023-11-02 22:09:40 +02:00
Kyle Brennan
f3cd71cc2d
[obs] remove noisy GitpodWorkspaceStuckOnStoppedMk2 ( #18998 )
...
Pods stuck in Stopped are removed after the 30m grace period: c346773e50/components/ws-manager-mk2/controllers/create.go (L54)
2023-11-02 00:05:39 +02:00
Kyle Brennan
c346773e50
[ops] change team from Workspace to Engine ( #18997 )
2023-11-01 22:46:39 +02:00
Thomas Schubart
b57d8707e4
Include regular not active alerts in Dedicated ( #18702 )
2023-09-13 13:33:53 +02:00
Aleksandar
bcfa933865
[alerts] group by cluster for the NodePoolLoad alert ( #18663 )
2023-09-06 08:44:03 +02:00
Kyle Brennan
9cc716412d
[obs] demote GitpodImageBuildDurationAnomaly to a warning ( #18507 )
2023-08-14 14:17:40 +02:00
Kyle Brennan
285f11d234
[obs] reduce false positives for GitpodImageBuildDurationAnomaly ( #18496 )
2023-08-11 16:46:37 +02:00
Kyle Brennan
e12de07cf2
[obs] GitpodImageBuildDurationAnomaly appears to fire too often for Dedicated ( #18472 )
...
Remove to avoid alerting on-callers. Circle back after we have a better expression, or means to define criteria that is exclusive to Dedicated.
2023-08-09 20:57:35 +02:00
Kyle Brennan
d8b68fd515
[obs] delay triggering GitpodWorkspaceTooManyRegularNotActiveMk2 ( #18409 )
...
Related false alarm: https://gitpod.slack.com/archives/C01TNS8EVQT/p1690945869670219
2023-08-02 20:49:27 +08:00
Kyle Brennan
b90e12b7cb
[obs] re-enable regular not active alerts ( #18341 )
...
* [obs] Add back critical regular not active alerts
Related to ENG-15
Now that we have related data, we should resume triggering alerts if the data condition occurs.
* [obs] Fix runbook_url for GitpodImageBuildDurationAnomaly
Was getting 404
* [obs] Fix GitpodWorkspaceTooManyRegularNotActiveMk2 given https://www.gitpodstatus.com/incidents/bsrqgmsxw1gr
* [obs] share why regular not active is excluded from Dedicated
* [obs] consolidate runbook for regular not active alerts
2023-07-31 21:31:26 +08:00
Wouter Verlaek
5e7eff45d8
Warn when IPFS is running out of storage ( #18386 )
...
* Warn when IPFS is running out of storage
* Add critical alert
2023-07-31 19:54:26 +08:00
Kyle Brennan
8266f6175c
Fix gitpod_workspace_regular_not_active_percentage_mk2, and temporarily disable related alerts ( #18332 )
...
We'll likely replace the alerts too, using one that detects anomalies given zscore.
Related to ENG-20
2023-07-24 19:40:40 +08:00
Kyle Brennan
2d3c03ee43
[obs] introduce workspace alerts for Dedicated ( #18331 )
...
* [obs] introduce GitpodImageBuildDurationAnomaly
Depends on https://github.com/gitpod-io/runbooks/pull/417
* [obs] Introduce GitpodImageBuilderCrashlooping
As per https://samber.github.io/awesome-prometheus-alerts/rules#rule-kubernetes-1-19
* [obs] Introduce GitpodImageBuilderReplicasMismatch
* [obs] use generic GitpodWorkspaceDeploymentCrashlooping for GitpodWsManagerCrashLoopingMk2
* Fix GitpodWsManagerCrashLoopingMk2
To avoid false positives
* Introduce GitpodWsManagerMk2ReplicasMismatch
* Fix syntax
* Fix GitpodWorkspaceDeploymentReplicaMismatch URL
* Introduce alerts for node-labeler and ws-proxy
* Fix severity and dedicated labels
* Fix proxy references
* Exclude ephemeral clusters
* Clean-up
2023-07-24 19:39:40 +08:00
Kyle Brennan
89797f48a4
[obs] add node variable and update panels in Node PSI dashboard to use it ( #18250 )
...
Related to WKS-303
2023-07-12 05:42:29 +08:00
Kyle Brennan
f3eb34242b
[obs] further restrict NodePoolLoad to avoid false positives ( #18222 )
...
Only trigger alerts when 4 or more nodes have high load average that is sustained over 1 for 60m.
2023-07-12 02:00:28 +08:00
Wouter Verlaek
ad21ecb48e
Fix: remove ws-manager.json dashboard ( #18230 )
2023-07-11 18:02:27 +08:00
Wouter Verlaek
85a0e9a67c
[ws-manager-mk2] Fix metric labels ( #18220 )
2023-07-11 16:55:28 +08:00
Kyle Brennan
13aa60d211
[obs] reduce noice related to GitpodWsManagerCrashLoopingMk2 ( #18181 )
...
When it triggers now, it's generally due to WKS-210, and is not valuable to gitpod.io or Dedicated in it's current form.
In other words, if ws-manager-mk2 restarts, it recovers and no action is needed. If it's unable to start, we won't be able to createWorkspace (and server should emit a signal).
Fixes WKS-288
2023-07-07 10:48:24 +08:00
Wouter Verlaek
76b585e6e8
Add longest running reconcile to dashboards ( #18022 )
2023-06-22 20:38:12 +08:00
Thomas Schubart
ab8244040d
[workspace] UPdate regular not active alert ( #17997 )
2023-06-21 16:09:12 +08:00
Kyle Brennan
6755b23081
[obs] Fix workspace alerts to work in Grafana cloud ( #17944 )
...
* [obs] tag workspace alerts for Dedicated
* [obs] rename duplicate workspace-monitoring-rules
* [obs] fix duplicate ws-daemon-monitoring-rules
2023-06-15 22:01:06 +08:00
Kyle Brennan
aba2cfdfe4
[obs] remove ws-manager (mk1) alert references ( #17941 )
...
* [obs] remove ws-manager (mk1) alert references
This includes old alerts, old recording rules, what appear to be SLO alert rules that appear incomplete, and fixing a bug with BackupFailureBecauseOfGitpodWsDaemonCrash
* [obs] flip GitpodImagebuildDoneSuccess and GitpodImagebuildStartSuccess back to warning due to prior false positives
2023-06-15 22:00:06 +08:00
Kyle Brennan
1422b8ea5a
[obs] fix and tune NodePoolLoad alert ( #17757 )
2023-05-26 19:12:00 +08:00
Kyle Brennan
2208a8792e
[obs] alert when cluster is under high load ( #17755 )
...
* [obs] formatting
* [obs] alert and inspect cluster due to high sustained load
2023-05-26 09:48:59 +08:00
Thomas Schubart
7c41572bc9
[wsman-mk2] Remove mk1 from workspace success ( #17497 )
2023-05-25 17:37:59 +08:00
Thomas Schubart
e8a3c4e3bc
[obs] Include p01, p25 and p75 in success criteria ( #17468 )
...
* [obs] Include p25 and p75 in success criteria
* [obs] Include p01 workspace startup
2023-05-03 17:23:41 +08:00
Kyle Brennan
4cd2d5519f
[obs] change startup time rate to rate interval ( #17453 )
...
A rate of 5m makes the graph too dense, and, it doesn't match the the overview dashboard's heatmap.
2023-05-02 22:56:40 +08:00
Thomas Schubart
958db8be9a
[obs] Fix casing of names ( #17418 )
2023-04-27 22:23:35 +08:00
Thomas Schubart
bea298ae17
[wsman-mk2] Include in workspace success critieria ( #17375 )
2023-04-27 03:39:35 +08:00
Thomas Schubart
c55d1f911e
[wsman-mk2] Add alerts for ws-manager-mk2 ( #17362 )
2023-04-26 00:07:46 +08:00
Wouter Verlaek
7050e289b4
[ws-manager-mk2] Dashboard controller heatmaps (WKS-21) ( #17093 )
...
* [ws-manager-mk2] Dashboard controller heatmaps
* [ws-daemon] Use heatmaps
2023-04-03 10:28:43 +02:00
Wouter Verlaek
e7b89d60d6
[ws-manager-mk2] Dashboard improvements ( #17120 )
2023-03-31 23:32:41 +02:00
Wouter Verlaek
a9810d6a0a
[ws-manager-mk2] Fix race where pod gets recreated in Stopped phase ( #16622 )
...
* [ws-manager-mk2] Fix race where pod gets recreated in Stopped phase
* [ws-manager-mk2] Add pod creation logs
* Change to Patch
2023-03-02 13:27:59 +01:00
Wouter Verlaek
cf0dd5571f
[ws-manager-mk2] Show start failures in dashboard, show daemon ctrl metrics ( #16612 )
2023-03-01 12:13:58 +01:00
Wouter Verlaek
d827a2b9dd
[ws-manager-mk2] Add queue depth and work duration panels ( #16555 )
2023-02-24 13:47:54 +01:00
Wouter Verlaek
733c37b2f8
[ws-manager-mk2] Import dashboard ( #16532 )
2023-02-23 15:12:53 +01:00
Wouter Verlaek
7440f00796
[ws-manager-mk2] Add Grafana dashboard ( #16455 )
...
* [ws-manager-mk2] Add Grafana dashboard
* [ws-manager-mk2] Add reconciliations by controller panel
2023-02-23 00:19:52 +01:00
Kyle Brennan
598b5372e8
[obs] Refactor alerts for image builds
...
For the last 30 days:
GitpodImagebuildDoneSuccess would have triggered once, on January 26 if set to 2h, instead of 4h. A customer was potentially struggling with the outer loop. We hit a 60% error rate in Pyrra briefly: https://pyrra.gitpod.io/objectives?expr={__name__=%22workspace-imagebuild-buildsdone-success-ratio%22,%20namespace=%22monitoring-central%22,%20team=%22workspace%22}&grouping={}&from=1673297716785&to=1675716916785
GitpodImagebuildStartSuccess would have fired once, on Jan 8, when GCP was having scaling issues, and would have been correct to do so. https://gitpod.slack.com/archives/C01TNS8EVQT/p1673173223060219
Removed the warnings because they're unnecessary. Why? Pyrra sends them now for SLOs to #team-workspace-alerts.
2023-02-16 14:51:21 +01:00
utam0k
33e6d1f540
obs: Make AutoscaleFailure ago down to warning level
2023-01-20 06:20:27 +01:00
Wouter Verlaek
e3ce970423
[observability] Add image build rate panels
2023-01-09 17:00:48 +01:00
Kyle Brennan
f08784fbc8
[obs] fix image-builder-mk3 dashboard
2022-12-26 02:24:34 +01:00
Kyle Brennan
c01d43b809
[obs] move blobserve from Workspace to IDE
2022-12-26 02:22:34 +01:00
Pudong Zheng
fc6355a8d2
[observability] fix datasource in preview environment
2022-12-09 06:54:19 -03:00
Christian Weichel
478a75e744
Switch license to AGPL
2022-12-08 13:05:19 -03:00