298 Commits

Author SHA1 Message Date
Milan Pavlik
a02a5d9db8 [alert] Page on failing workspace starts 2023-02-17 13:23:21 +01:00
Kyle Brennan
598b5372e8 [obs] Refactor alerts for image builds
For the last 30 days:

GitpodImagebuildDoneSuccess would have triggered once, on January 26 if set to 2h, instead of 4h.  A customer was potentially struggling with the outer loop. We hit a 60% error rate in Pyrra briefly: https://pyrra.gitpod.io/objectives?expr={__name__=%22workspace-imagebuild-buildsdone-success-ratio%22,%20namespace=%22monitoring-central%22,%20team=%22workspace%22}&grouping={}&from=1673297716785&to=1675716916785

GitpodImagebuildStartSuccess would have fired once, on Jan 8, when GCP was having scaling issues, and would have been correct to do so. https://gitpod.slack.com/archives/C01TNS8EVQT/p1673173223060219

Removed the warnings because they're unnecessary. Why? Pyrra sends them now for SLOs to #team-workspace-alerts.
2023-02-16 14:51:21 +01:00
Milan Pavlik
7a8f76f9e5 [ws-man-bridge] Adjust CPU alert to provide better signal 2023-02-16 14:17:20 +01:00
Milan Pavlik
994debf5c0 [dashboard] k8s applications 2023-02-16 08:56:21 +01:00
Kyle Brennan
fc1b4af8e0 [obs] Temporarily avoid imageBuildFailure reason
Why? This alert fires too often / is generally a false positive. In other words, in it's current form, it's not a signal of a system failure.
2023-02-07 07:52:45 +01:00
Milan Pavlik
4628ccb5e6 [grafana] Cleanup server component dashboard 2023-01-27 16:27:34 +01:00
Milan Pavlik
961a3c33ed [alerts] Exclude all of 2xx, 3xx, 4xx from JSON RPC API Error Rates 2023-01-25 16:37:32 +01:00
Milan Pavlik
8b88c8f99d [dashboards] Fix double comma 2023-01-25 16:15:33 +01:00
Milan Pavlik
324b8d4950 [dashboard] Migrate server dashboard to timeseries visualization 2023-01-25 14:31:33 +01:00
Milan Pavlik
63817fdff0 [alerts] Reduce trigger duration for Stripe Webhook Failure alert 2023-01-23 11:45:30 +01:00
utam0k
33e6d1f540 obs: Make AutoscaleFailure ago down to warning level 2023-01-20 06:20:27 +01:00
Milan Pavlik
51c4adf124 [obs] Adjust CPU alert thresholds for webapp services 2023-01-18 15:07:26 +01:00
Milan Pavlik
dec43f11fe [obs] Fix webapp monitoring rule names 2023-01-18 14:25:25 +01:00
Milan Pavlik
0ceaa6532f [webapp] Group CPU alerts by deployment 2023-01-17 10:06:25 +01:00
Wouter Verlaek
b32eb221e7 Switch image builds axis on overview dashboard 2023-01-12 19:34:52 +01:00
Wouter Verlaek
e3ce970423 [observability] Add image build rate panels 2023-01-09 17:00:48 +01:00
Kyle Brennan
f08784fbc8 [obs] fix image-builder-mk3 dashboard 2022-12-26 02:24:34 +01:00
Kyle Brennan
c01d43b809 [obs] move blobserve from Workspace to IDE 2022-12-26 02:22:34 +01:00
ArthurSens
5d96084625 Delete unused PrometheusRules
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
2022-12-14 04:38:23 -03:00
mustard
0576091fe1 [observability] add job variable for grpc client 2022-12-14 03:53:23 -03:00
Andrea Falzetti
729e0d8aa7 [ide-service]: update grafana dashboard
Co-authored-by: Victor Nogueira <victor@gitpod.io>
2022-12-09 06:56:18 -03:00
Pudong Zheng
fc6355a8d2 [observability] fix datasource in preview environment 2022-12-09 06:54:19 -03:00
Christian Weichel
478a75e744 Switch license to AGPL 2022-12-08 13:05:19 -03:00
Pudong Zheng
422c7cb690 [observability] fix ide-service dashboard 2022-12-08 05:37:18 -03:00
Pavel Tumik @ GitPod
11b9774e3a [alerts] improve autoscale alert to provide actual reason for failure in alert message 2022-12-07 22:49:17 -03:00
Milan Pavlik
227beab32b [alerts] Usage - increase duration of expression to 30m 2022-12-07 05:01:17 -03:00
ArthurSens
4f8927deea Increase scrape interval to decrease datapoints per minute
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
2022-12-06 12:19:16 -03:00
Victor Nogueira
d04294d791 Add IDE Service Dashboard
Co-authored-by: Anton Kosyakov <anton@gitpod.io>
2022-12-06 10:04:16 -03:00
Kyle Brennan
e845faae3c Update operations/observability/mixins/workspace/rules/central/nodes.yaml
Co-authored-by: Wouter Verlaek <wouter@gitpod.io>
2022-12-02 05:56:01 -03:00
Kyle Brennan
171ec14d53 Add alerts for image build success rate 2022-12-02 05:56:01 -03:00
Kyle Brennan
fc3586b5e2 Change GitpodWorkspaceStuckOnStarting to GitpodWorkspaceStuckOnStopped 2022-12-02 05:56:01 -03:00
Kyle Brennan
603446291f Reduce noise with GitpodWorkspaceNodeHighNormalizedLoadAverage now that we have network limiting and PSI so its more actionable 2022-12-02 05:56:01 -03:00
ArthurSens
d2eea10fbc Drop unused ArgoCD Metrics
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
2022-12-01 14:44:00 -03:00
Pudong Zheng
71118933d8 [observability] merge all ide kind for ide startup time 2022-11-30 06:54:59 -03:00
Pudong Zheng
82eaa40d3a [observability] add IDE startup time dashboard 2022-11-28 09:00:57 -03:00
Pudong Zheng
580e20fd20 [observability] add ide startup time metrics 2022-11-23 11:51:53 -03:00
Thomas Schubart
6469258f28 Update StuckOnStopping allert 2022-11-21 14:19:51 -03:00
mustard
47865a0c76 [observability] add alert for upstream down 2022-11-15 19:14:45 +02:00
ArthurSens
793877aa5f Create resources to monitor Pyrra
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
2022-11-15 17:35:45 +02:00
mustard
6fba3543bb [observability] add vscode install extension failure alert rule 2022-11-11 20:03:41 +02:00
Arthur Silva Sens
77c6026f65 Fix conflicting PrometheusRules 2022-11-10 16:37:40 +02:00
ArthurSens
c0c6b3a150 Fix syntax errors
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
2022-11-09 17:58:39 +02:00
mustard
ebe2dad066 [observability] add threads and latency charts 2022-11-09 11:10:39 +01:00
Kyle Brennan
6ff821261d Reduce noise for GitpodWorkspaceNodeHighNormalizedLoadAverage 2022-11-07 23:10:37 +01:00
ArthurSens
ebed98a31c Split workspace alerts into central and satellite
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
2022-11-07 23:10:37 +01:00
ArthurSens
88bfdb998a Prepare workspace alerts to centralized alerting
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
2022-11-07 23:10:37 +01:00
Anton Kosyakov
e2f03743b5 [openvsx-mirror] add incoming requests monitoring 2022-11-07 11:53:36 +01:00
Jean Pierre
9e6f2f64b6 Fix galleryHost label 2022-11-04 16:39:09 +01:00
Pudong Zheng
b835a407e3 [observability] Adjusting openvsx-mirror and code-browser dashboard 2022-11-04 14:42:08 +01:00
Pudong Zheng
1e2ec46e64 [observability] Adjusting openvsx-mirror dashboard 2022-11-04 12:18:09 +01:00