gitpod

mirror of https://github.com/gitpod-io/gitpod.git synced 2025-12-08 17:36:30 +00:00

Author	SHA1	Message	Date
Kyle Brennan	696ec03449	[obs] fix alerts for workspace deployments (#19502 ) The proper label to use with `kube_deployment_spec_replicas` is deployment. `container` is okay with `kube_pod_container_status_restarts_total`	2024-03-05 01:49:14 +02:00
Gero Posmyk-Leinemann	d40119977c	GitpodWorkspaceTooManyRegularNotActiveMk2: add lower bound of 20 (#19439 )	2024-02-19 11:53:00 +02:00
Kyle Brennan	31db761ae7	[obs] tune GitpodWorkspaceStuckOnStoppingMk2 alert (#19299 ) This alert fired 8 times for us107 since Dec 23, none of the pages required action from the on-call operator. Let's make it more difficult for the alert to fire, to avoid unnecessary escalations to on-call.	2023-12-31 22:30:38 +02:00
Mads Hartmann	3ff9173705	This adds an alert for GitpodWorkspaceHighStartFailureRate (#19099 )	2023-11-21 14:07:58 +02:00
Wouter Verlaek	110defe741	Use humanize1024 instead (#19036 )	2023-11-08 11:19:45 +02:00
Wouter Verlaek	7af72bb539	Add GiB left in storage warning (#19035 )	2023-11-08 10:45:45 +02:00
Kyle Brennan	5b13b510ec	[obs] remove GitpodImagebuildStartSuccess warning (#19002 ) This expression has dips regularly, and does not provide value as a notification in its current form.	2023-11-02 22:09:40 +02:00
Kyle Brennan	f3cd71cc2d	[obs] remove noisy GitpodWorkspaceStuckOnStoppedMk2 (#18998 ) Pods stuck in Stopped are removed after the 30m grace period: `c346773e50/components/ws-manager-mk2/controllers/create.go (L54)`	2023-11-02 00:05:39 +02:00
Kyle Brennan	c346773e50	[ops] change team from Workspace to Engine (#18997 )	2023-11-01 22:46:39 +02:00
Thomas Schubart	b57d8707e4	Include regular not active alerts in Dedicated (#18702 )	2023-09-13 13:33:53 +02:00
Aleksandar	bcfa933865	[alerts] group by cluster for the NodePoolLoad alert (#18663 )	2023-09-06 08:44:03 +02:00
Kyle Brennan	9cc716412d	[obs] demote GitpodImageBuildDurationAnomaly to a warning (#18507 )	2023-08-14 14:17:40 +02:00
Kyle Brennan	285f11d234	[obs] reduce false positives for GitpodImageBuildDurationAnomaly (#18496 )	2023-08-11 16:46:37 +02:00
Kyle Brennan	e12de07cf2	[obs] GitpodImageBuildDurationAnomaly appears to fire too often for Dedicated (#18472 ) Remove to avoid alerting on-callers. Circle back after we have a better expression, or means to define criteria that is exclusive to Dedicated.	2023-08-09 20:57:35 +02:00
Kyle Brennan	d8b68fd515	[obs] delay triggering GitpodWorkspaceTooManyRegularNotActiveMk2 (#18409 ) Related false alarm: https://gitpod.slack.com/archives/C01TNS8EVQT/p1690945869670219	2023-08-02 20:49:27 +08:00
Kyle Brennan	b90e12b7cb	[obs] re-enable regular not active alerts (#18341 ) * [obs] Add back critical regular not active alerts Related to ENG-15 Now that we have related data, we should resume triggering alerts if the data condition occurs. * [obs] Fix runbook_url for GitpodImageBuildDurationAnomaly Was getting 404 * [obs] Fix GitpodWorkspaceTooManyRegularNotActiveMk2 given https://www.gitpodstatus.com/incidents/bsrqgmsxw1gr * [obs] share why regular not active is excluded from Dedicated * [obs] consolidate runbook for regular not active alerts	2023-07-31 21:31:26 +08:00
Wouter Verlaek	5e7eff45d8	Warn when IPFS is running out of storage (#18386 ) * Warn when IPFS is running out of storage * Add critical alert	2023-07-31 19:54:26 +08:00
Kyle Brennan	8266f6175c	Fix gitpod_workspace_regular_not_active_percentage_mk2, and temporarily disable related alerts (#18332 ) We'll likely replace the alerts too, using one that detects anomalies given zscore. Related to ENG-20	2023-07-24 19:40:40 +08:00
Kyle Brennan	2d3c03ee43	[obs] introduce workspace alerts for Dedicated (#18331 ) * [obs] introduce GitpodImageBuildDurationAnomaly Depends on https://github.com/gitpod-io/runbooks/pull/417 * [obs] Introduce GitpodImageBuilderCrashlooping As per https://samber.github.io/awesome-prometheus-alerts/rules#rule-kubernetes-1-19 * [obs] Introduce GitpodImageBuilderReplicasMismatch * [obs] use generic GitpodWorkspaceDeploymentCrashlooping for GitpodWsManagerCrashLoopingMk2 * Fix GitpodWsManagerCrashLoopingMk2 To avoid false positives * Introduce GitpodWsManagerMk2ReplicasMismatch * Fix syntax * Fix GitpodWorkspaceDeploymentReplicaMismatch URL * Introduce alerts for node-labeler and ws-proxy * Fix severity and dedicated labels * Fix proxy references * Exclude ephemeral clusters * Clean-up	2023-07-24 19:39:40 +08:00
Kyle Brennan	89797f48a4	[obs] add node variable and update panels in Node PSI dashboard to use it (#18250 ) Related to WKS-303	2023-07-12 05:42:29 +08:00
Kyle Brennan	f3eb34242b	[obs] further restrict NodePoolLoad to avoid false positives (#18222 ) Only trigger alerts when 4 or more nodes have high load average that is sustained over 1 for 60m.	2023-07-12 02:00:28 +08:00
Wouter Verlaek	ad21ecb48e	Fix: remove ws-manager.json dashboard (#18230 )	2023-07-11 18:02:27 +08:00
Wouter Verlaek	85a0e9a67c	[ws-manager-mk2] Fix metric labels (#18220 )	2023-07-11 16:55:28 +08:00
Kyle Brennan	13aa60d211	[obs] reduce noice related to GitpodWsManagerCrashLoopingMk2 (#18181 ) When it triggers now, it's generally due to WKS-210, and is not valuable to gitpod.io or Dedicated in it's current form. In other words, if ws-manager-mk2 restarts, it recovers and no action is needed. If it's unable to start, we won't be able to createWorkspace (and server should emit a signal). Fixes WKS-288	2023-07-07 10:48:24 +08:00
Wouter Verlaek	76b585e6e8	Add longest running reconcile to dashboards (#18022 )	2023-06-22 20:38:12 +08:00
Thomas Schubart	ab8244040d	[workspace] UPdate regular not active alert (#17997 )	2023-06-21 16:09:12 +08:00
Kyle Brennan	6755b23081	[obs] Fix workspace alerts to work in Grafana cloud (#17944 ) * [obs] tag workspace alerts for Dedicated * [obs] rename duplicate workspace-monitoring-rules * [obs] fix duplicate ws-daemon-monitoring-rules	2023-06-15 22:01:06 +08:00
Kyle Brennan	aba2cfdfe4	[obs] remove ws-manager (mk1) alert references (#17941 ) * [obs] remove ws-manager (mk1) alert references This includes old alerts, old recording rules, what appear to be SLO alert rules that appear incomplete, and fixing a bug with BackupFailureBecauseOfGitpodWsDaemonCrash * [obs] flip GitpodImagebuildDoneSuccess and GitpodImagebuildStartSuccess back to warning due to prior false positives	2023-06-15 22:00:06 +08:00
Kyle Brennan	1422b8ea5a	[obs] fix and tune NodePoolLoad alert (#17757 )	2023-05-26 19:12:00 +08:00
Kyle Brennan	2208a8792e	[obs] alert when cluster is under high load (#17755 ) * [obs] formatting * [obs] alert and inspect cluster due to high sustained load	2023-05-26 09:48:59 +08:00
Thomas Schubart	7c41572bc9	[wsman-mk2] Remove mk1 from workspace success (#17497 )	2023-05-25 17:37:59 +08:00
Thomas Schubart	e8a3c4e3bc	[obs] Include p01, p25 and p75 in success criteria (#17468 ) * [obs] Include p25 and p75 in success criteria * [obs] Include p01 workspace startup	2023-05-03 17:23:41 +08:00
Kyle Brennan	4cd2d5519f	[obs] change startup time rate to `rate interval` (#17453 ) A rate of 5m makes the graph too dense, and, it doesn't match the the overview dashboard's heatmap.	2023-05-02 22:56:40 +08:00
Thomas Schubart	958db8be9a	[obs] Fix casing of names (#17418 )	2023-04-27 22:23:35 +08:00
Thomas Schubart	bea298ae17	[wsman-mk2] Include in workspace success critieria (#17375 )	2023-04-27 03:39:35 +08:00
Thomas Schubart	c55d1f911e	[wsman-mk2] Add alerts for ws-manager-mk2 (#17362 )	2023-04-26 00:07:46 +08:00
Wouter Verlaek	7050e289b4	[ws-manager-mk2] Dashboard controller heatmaps (WKS-21) (#17093 ) * [ws-manager-mk2] Dashboard controller heatmaps * [ws-daemon] Use heatmaps	2023-04-03 10:28:43 +02:00
Wouter Verlaek	e7b89d60d6	[ws-manager-mk2] Dashboard improvements (#17120 )	2023-03-31 23:32:41 +02:00
Wouter Verlaek	a9810d6a0a	[ws-manager-mk2] Fix race where pod gets recreated in Stopped phase (#16622 ) * [ws-manager-mk2] Fix race where pod gets recreated in Stopped phase * [ws-manager-mk2] Add pod creation logs * Change to Patch	2023-03-02 13:27:59 +01:00
Wouter Verlaek	cf0dd5571f	[ws-manager-mk2] Show start failures in dashboard, show daemon ctrl metrics (#16612 )	2023-03-01 12:13:58 +01:00
Wouter Verlaek	d827a2b9dd	[ws-manager-mk2] Add queue depth and work duration panels (#16555 )	2023-02-24 13:47:54 +01:00
Wouter Verlaek	733c37b2f8	[ws-manager-mk2] Import dashboard (#16532 )	2023-02-23 15:12:53 +01:00
Wouter Verlaek	7440f00796	[ws-manager-mk2] Add Grafana dashboard (#16455 ) * [ws-manager-mk2] Add Grafana dashboard * [ws-manager-mk2] Add reconciliations by controller panel	2023-02-23 00:19:52 +01:00
Kyle Brennan	598b5372e8	[obs] Refactor alerts for image builds For the last 30 days: GitpodImagebuildDoneSuccess would have triggered once, on January 26 if set to 2h, instead of 4h. A customer was potentially struggling with the outer loop. We hit a 60% error rate in Pyrra briefly: https://pyrra.gitpod.io/objectives?expr={__name__=%22workspace-imagebuild-buildsdone-success-ratio%22,%20namespace=%22monitoring-central%22,%20team=%22workspace%22}&grouping={}&from=1673297716785&to=1675716916785 GitpodImagebuildStartSuccess would have fired once, on Jan 8, when GCP was having scaling issues, and would have been correct to do so. https://gitpod.slack.com/archives/C01TNS8EVQT/p1673173223060219 Removed the warnings because they're unnecessary. Why? Pyrra sends them now for SLOs to #team-workspace-alerts.	2023-02-16 14:51:21 +01:00
utam0k	33e6d1f540	obs: Make AutoscaleFailure ago down to warning level	2023-01-20 06:20:27 +01:00
Wouter Verlaek	e3ce970423	[observability] Add image build rate panels	2023-01-09 17:00:48 +01:00
Kyle Brennan	f08784fbc8	[obs] fix image-builder-mk3 dashboard	2022-12-26 02:24:34 +01:00
Kyle Brennan	c01d43b809	[obs] move blobserve from Workspace to IDE	2022-12-26 02:22:34 +01:00
Pudong Zheng	fc6355a8d2	[observability] fix datasource in preview environment	2022-12-09 06:54:19 -03:00
Christian Weichel	478a75e744	Switch license to AGPL	2022-12-08 13:05:19 -03:00

1 2 3

138 Commits