378 Commits

Author SHA1 Message Date
Kyle Brennan
217ba2d2a4
[ops] add Slack alert for GitpodImageBuilderMk3InternalErrors (#20753)
* [ops] add Slack alert for GitpodImageBuilderMk3InternalErrors

* Fix
2025-04-21 11:38:31 -04:00
Gero Posmyk-Leinemann
3e570aea5d
[obs] Make sure to not alert for 0 backup failures (#20458) 2024-12-16 09:53:02 -05:00
Gero Posmyk-Leinemann
9de83391fc
[observability] Introduce "ReplicaUnavailable" alerts (#20344)
* [observability] ReplicaMismatch: Improve "the mismatch is 1.0" message

* [observability] Introduce "ReplicasUnavailable" alert (as warning for now)
2024-11-07 02:42:03 -05:00
Kyle Brennan
2c2a86e264
[ops] Introduce GitpodWsManagerMk2BackupFailureError and GitpodWsManagerMk2BackupFailureCritical (#20259)
* [ops] Introduce GitpodWsManagerMk2BackupFailureError and GitpodWsManagerMk2BackupFailureCritical

* Fix
2024-10-02 03:07:17 -04:00
Kyle Brennan
ff325d8cb6
[obs] reduce potential for ProxyBadGateway to trigger (#19959)
Now, if we've violated five 9's, then we'll trigger...but not before (which is what has been happening lately).

Ref: https://gitpod.slack.com/archives/C01TNS8EVQT/p1718979601030609?thread_ts=1718961427.787819&cid=C01TNS8EVQT
2024-06-21 17:27:17 +02:00
Kyle Brennan
4c5ff41c86
[obs] Add Caddy metrics to Proxy dashboard & Caddy alerts (#19926)
* [obs] add caddy response rate for proxy

* Update

* More

* Fix X axis legend label

* Add request duration heatmap

* Introduce ProxyBadGateway alert
2024-06-19 14:08:14 +02:00
Pudong
8050a04ab8
[alert] remove WorkspacesNotStartingMk2 and WorkspaceTooManyRegularNotActiveMk2 from dedicated (#19760) 2024-05-22 01:27:02 +08:00
Kyle Brennan
f60af532e9
[obs] fix startup time panels (#19753)
They were under the normalized load average for headless nodes
2024-05-17 20:26:55 +08:00
Kyle Brennan
bcf7797941
Opt WsManagerBridgeHighEventsReceived into Enterprise alering (#19752) 2024-05-17 00:20:54 +08:00
Kyle Brennan
6f9cb64b8b
[obs] alert on-call when ws-manager-bridge events received is too high (#19749) 2024-05-17 00:04:54 +08:00
Kyle Brennan
a31baf0416
[obs] a couple QOL improvements (#19702)
* [obs] reload Bridge dashboard on time change

* [obs] increase rate to 5m

A default of 1m is too fine grained, and causes us to miss trends.
2024-05-06 19:34:44 +08:00
Huiwen
86727b2e50
[observability] increase dashboard pods readiness check duration (#19568) 2024-03-22 01:12:30 +01:00
Kyle Brennan
309815ba08
[obs] temporarily reduce alert frequency (#19545) 2024-03-15 03:43:24 +02:00
Sven Efftinge
3f015f924c
[server] reduce false positive alerting (#19539) 2024-03-13 16:25:22 +02:00
Thomas Schubart
dff668ae30
Add headless node dashboard (#19524) 2024-03-08 18:04:18 +02:00
Thomas Schubart
3e0955fce4
Alert for high load average on headless nodes (#19522) 2024-03-08 12:20:17 +02:00
Kyle Brennan
696ec03449
[obs] fix alerts for workspace deployments (#19502)
The proper label to use with `kube_deployment_spec_replicas` is deployment.

`container` is okay with `kube_pod_container_status_restarts_total`
2024-03-05 01:49:14 +02:00
Gero Posmyk-Leinemann
bdad590b09
[server] Missing SCM access: Filter out user error on workspace start (#19469)
* [server] Missing SCM access: Filter out user error on workspace start

to prevent false alerts (EXP-1434)

* [proxy] api.: Handle /auth/*/callback
2024-02-27 09:13:08 +02:00
Gero Posmyk-Leinemann
d40119977c
GitpodWorkspaceTooManyRegularNotActiveMk2: add lower bound of 20 (#19439) 2024-02-19 11:53:00 +02:00
Huiwen
d2c0ed2882
[observability] update dashboard not all ready alert (#19317) 2024-01-15 12:01:52 +02:00
Kyle Brennan
31db761ae7
[obs] tune GitpodWorkspaceStuckOnStoppingMk2 alert (#19299)
This alert fired 8 times for us107 since Dec 23, none of the pages required action from the on-call operator.

Let's make it more difficult for the alert to fire, to avoid unnecessary escalations to on-call.
2023-12-31 22:30:38 +02:00
Kyle Brennan
47eef601a1
[ws-proxy] add SSH Summary to overview dashboard (#19281)
Related to ENG-1340 and ENG-1350
2023-12-18 21:08:25 +02:00
Huiwen
c18e049eb3
[observability] add DashboardStuckInPodInitState alert (#19258)
* [observability] add `DashboardStuckInPodInitState` alert

* Update expr
2023-12-15 10:19:22 +02:00
Anton Kosyakov
9bd532b474
add GitpodV1APIServerErrors alert (#19212) 2023-12-07 16:15:14 +02:00
Mads Hartmann
3ff9173705
This adds an alert for GitpodWorkspaceHighStartFailureRate (#19099) 2023-11-21 14:07:58 +02:00
Wouter Verlaek
110defe741
Use humanize1024 instead (#19036) 2023-11-08 11:19:45 +02:00
Wouter Verlaek
7af72bb539
Add GiB left in storage warning (#19035) 2023-11-08 10:45:45 +02:00
Kyle Brennan
5b13b510ec
[obs] remove GitpodImagebuildStartSuccess warning (#19002)
This expression has dips regularly, and does not provide value as a notification in its current form.
2023-11-02 22:09:40 +02:00
Kyle Brennan
f3cd71cc2d
[obs] remove noisy GitpodWorkspaceStuckOnStoppedMk2 (#18998)
Pods stuck in Stopped are removed after the 30m grace period: c346773e50/components/ws-manager-mk2/controllers/create.go (L54)
2023-11-02 00:05:39 +02:00
Kyle Brennan
c346773e50
[ops] change team from Workspace to Engine (#18997) 2023-11-01 22:46:39 +02:00
Wouter Verlaek
30a4280ab8
Add IPFS storage to overview dashboard (#18949) 2023-10-18 13:37:25 +03:00
Gero Posmyk-Leinemann
83d20c1415
[grafana] SpiceDB: add graph for request consistency (#18904) 2023-10-11 10:50:19 +03:00
Gero Posmyk-Leinemann
a84cca8e4c
[grafana] SpiceDB dashboard: add graphs for cache hit/miss rate and ratio (#18897) 2023-10-09 15:19:17 +03:00
Gero Posmyk-Leinemann
6c7f47dd7b
[alerts] Exclude reason "imageBuildFailedUser" from InstanceStartFailures (#18768) 2023-09-22 08:37:01 +02:00
Thomas Schubart
b57d8707e4
Include regular not active alerts in Dedicated (#18702) 2023-09-13 13:33:53 +02:00
Aleksandar
bcfa933865
[alerts] group by cluster for the NodePoolLoad alert (#18663) 2023-09-06 08:44:03 +02:00
Kyle Brennan
9cc716412d
[obs] demote GitpodImageBuildDurationAnomaly to a warning (#18507) 2023-08-14 14:17:40 +02:00
Kyle Brennan
285f11d234
[obs] reduce false positives for GitpodImageBuildDurationAnomaly (#18496) 2023-08-11 16:46:37 +02:00
Kyle Brennan
e12de07cf2
[obs] GitpodImageBuildDurationAnomaly appears to fire too often for Dedicated (#18472)
Remove to avoid alerting on-callers. Circle back after we have a better expression, or means to define criteria that is exclusive to Dedicated.
2023-08-09 20:57:35 +02:00
Kyle Brennan
d8b68fd515
[obs] delay triggering GitpodWorkspaceTooManyRegularNotActiveMk2 (#18409)
Related false alarm: https://gitpod.slack.com/archives/C01TNS8EVQT/p1690945869670219
2023-08-02 20:49:27 +08:00
Kyle Brennan
b90e12b7cb
[obs] re-enable regular not active alerts (#18341)
* [obs] Add back critical regular not active alerts

Related to ENG-15

Now that we have related data, we should resume triggering alerts if the data condition occurs.

* [obs] Fix runbook_url for GitpodImageBuildDurationAnomaly

Was getting 404

* [obs] Fix GitpodWorkspaceTooManyRegularNotActiveMk2 given https://www.gitpodstatus.com/incidents/bsrqgmsxw1gr

* [obs] share why regular not active is excluded from Dedicated

* [obs] consolidate runbook for regular not active alerts
2023-07-31 21:31:26 +08:00
Wouter Verlaek
5e7eff45d8
Warn when IPFS is running out of storage (#18386)
* Warn when IPFS is running out of storage

* Add critical alert
2023-07-31 19:54:26 +08:00
Milan Pavlik
3422cc7085
[messagebus] Remove alerts + dashboards (#18337) 2023-07-24 22:19:40 +08:00
Kyle Brennan
8266f6175c
Fix gitpod_workspace_regular_not_active_percentage_mk2, and temporarily disable related alerts (#18332)
We'll likely replace the alerts too, using one that detects anomalies given zscore.

Related to ENG-20
2023-07-24 19:40:40 +08:00
Kyle Brennan
2d3c03ee43
[obs] introduce workspace alerts for Dedicated (#18331)
* [obs] introduce GitpodImageBuildDurationAnomaly

Depends on https://github.com/gitpod-io/runbooks/pull/417

* [obs] Introduce GitpodImageBuilderCrashlooping

As per https://samber.github.io/awesome-prometheus-alerts/rules#rule-kubernetes-1-19

* [obs] Introduce GitpodImageBuilderReplicasMismatch

* [obs] use generic GitpodWorkspaceDeploymentCrashlooping for GitpodWsManagerCrashLoopingMk2

* Fix GitpodWsManagerCrashLoopingMk2

To avoid false positives

* Introduce GitpodWsManagerMk2ReplicasMismatch

* Fix syntax

* Fix GitpodWorkspaceDeploymentReplicaMismatch URL

* Introduce alerts for node-labeler and ws-proxy

* Fix severity and dedicated labels

* Fix proxy references

* Exclude ephemeral clusters

* Clean-up
2023-07-24 19:39:40 +08:00
Brad Harris
9d88f8d5e5
ignore 640 error codes (#18249) 2023-07-13 02:33:29 +08:00
Kyle Brennan
89797f48a4
[obs] add node variable and update panels in Node PSI dashboard to use it (#18250)
Related to WKS-303
2023-07-12 05:42:29 +08:00
Kyle Brennan
f3eb34242b
[obs] further restrict NodePoolLoad to avoid false positives (#18222)
Only trigger alerts when 4 or more nodes have high load average that is sustained over 1 for 60m.
2023-07-12 02:00:28 +08:00
Milan Pavlik
45a8c259a7
[server] Add Redis subscription/broadcasting to dashboard (#18231) 2023-07-11 19:18:28 +08:00
Wouter Verlaek
ad21ecb48e
Fix: remove ws-manager.json dashboard (#18230) 2023-07-11 18:02:27 +08:00