374 Commits

Author SHA1 Message Date
Manuel Alejandro de Brito Fontes
b9db8b349b Add Summary row to Gitpod overview dashboard 2022-07-20 13:32:15 -03:00
Manuel Alejandro de Brito Fontes
18c764cbac Add dashboard for swap utilization per cluster and node 2022-07-19 19:40:14 -03:00
ArthurSens
735a30899f Update Preview env's dashboard with new metrics
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
2022-07-18 16:39:13 +02:00
Pudong Zheng
25c5bfbecb [alerts] change alert for adding new nodes rapidly to only count if node type is regular workspace 2022-07-18 03:40:13 +02:00
Nandaja Varma
d13bdfd0cd Improve GitpodWorkspaceTooManyRegularNotActive alert 2022-07-11 06:09:58 +05:30
utam0k
04d945d216 obserbility: Add a alert for AutoscaleFailure. 2022-07-06 00:36:52 +05:30
JenTing Hsiao
7800a21c4d [alerts] fix pod/container/namespace not rendering
Because every time series is uniquely identified by its metric name
a set of labels, and every unique combination of key-value label pairs
represents a new alert for this time series.

There is no common value for these metrics
- kube_pod_container_status_restarts_total
- gitpod_ws_manager_workspace_backups_failure_total

Signed-off-by: JenTing Hsiao <hsiaoairplane@gmail.com>
2022-07-01 06:23:39 +05:30
Arthur Silva Sens
028ef2608b Update overview.json 2022-06-27 13:46:36 +05:30
utam0k
6c2705fbe4 observability: Ring the phone only when a data loss occurs with GitpodWsDaemonCrashLoopingg 2022-06-23 19:06:32 +05:30
Pavel Tumik
cf35903aff Apply suggestions from code review 2022-06-23 02:46:31 +05:30
Pavel Tumik
7e0fe457fb Apply suggestions from code review 2022-06-23 02:46:31 +05:30
utam0k
62859996d5 observability: Add GitpodWorkspaceTooLongTerminating alert. 2022-06-23 02:46:31 +05:30
ArthurSens
9be43de166 Add SLIs to preview-environment dashboard
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
2022-06-22 12:15:31 +05:30
Mads Hartmann
7b7282e427 Add text panel with link to Preview Start SLO 2022-06-20 17:36:29 +05:30
Anton Kosyakov
bbee5a8df2 [jb] actually fix dashboard 2022-06-17 13:49:26 +05:30
Anton Kosyakov
92ddbcdf24 [jb] fix dashboard 2022-06-17 13:00:26 +05:30
Jan Keromnes
1858e5d61b Remove critical alert GitpodWsDaemonExcessiveGC > 60s (but keep the non-critical warning for now) 2022-06-16 20:56:26 +05:30
Anton Kosyakov
4ff0c7e3b8 [jb]: monitor low memory notifications and gc overhead 2022-06-16 19:41:25 +05:30
Anton Kosyakov
d3dd3be018 Update ssh gw dashboard
Simplify it to show most important info:
- throughput of valid and suspicious traffic
- total failure ratios and error ratios by kind
2022-06-16 17:45:25 +05:30
Arthur Silva Sens
0d02877ace Update preview environment dashboard 2022-06-09 15:39:19 +05:30
Pudong Zheng
931aed87d2 [observability] add SSH gateway overview dashboard 2022-06-08 08:59:17 +05:30
ArthurSens
3a52ba0d10 Add dashboard to monitor preview environments
Signed-off-by: ArthurSens <arthursens2005@gmail.com>
2022-06-06 21:04:17 +05:30
Moritz Eysholdt
65beac91df Fix runbook URL 2022-06-03 22:59:52 +05:30
André Duarte
135a7de765 [observability] Add SLI numbers to the Workspace Success Criteria Dashboard 2022-06-01 17:44:50 +05:30
André Duarte
24031c69ea Improve appearance of Workspace Success Criteria Dashboard 2022-05-30 22:09:48 +05:30
André Duarte
4be3bc9643 Improve Workspace Success Criteria Dashboard
* Ignore non-production clusters (i.e. `prod-meta-.*` and `ephemeral.*`)
* Create a view that represents the overall P50 and P95 workspace startup times
* Group "By Cluster" views in a collapsable row
2022-05-30 22:09:48 +05:30
Gero Posmyk-Leinemann
c025034894 [cleanup] Remove kedge 2022-05-27 14:51:45 +05:30
ArthurSens
49de2b95a0 Add example dashboard for monitoring self-hosted 2022-05-26 21:52:45 +05:30
Pavel Tumik
a6d5ec1055 [observability] add horizontal line for load avg panel 2022-05-24 19:19:39 +05:30
Pavel Tumik
ad8d971176 [alerts] add alert when autoscaler adds nodes rapidly 2022-05-19 12:55:34 +05:30
Prince Rachit Sinha
5045e85f2a [observability] Add alerts for pending phase 2022-05-02 16:39:18 +05:30
Milan Pavlik
0b2aa0d34b [wsm-bridge] Dashboard with health metrics 2022-04-29 18:59:15 +05:30
Prince Rachit Sinha
6f55a53d4e [observability] Add filtering n update link n vars 2022-04-27 17:09:13 +05:30
Prince Rachit Sinha
4515e369fc
[observability] Fix incorrect datasource (#9543) 2022-04-26 11:46:49 +02:00
Prince Rachit Sinha
994ddfe03e [observability] Fix Dashboard Error and add Nodepool 2022-04-26 14:07:38 +05:30
Prince Rachit Sinha
de0c0e80a4 [observability] Add GitpodWorkspacesNotStarting alert 2022-04-26 08:53:38 +05:30
Milan Pavlik
75bd9fc4be [ws-man-bridge] Fix Workspace Instance status update dashboard to work across web-app clusters 2022-04-25 17:33:37 +05:30
Prince Rachit Sinha
d7bf168392 [Observability] Add node resource consumption dashboards
Add grafana dashboard with id 11074 and update it to include cluster
level filtering and translation of chinese language to english.
2022-04-22 13:17:34 +05:30
Prince Rachit Sinha
64fbd1e841 [observability] Add alert rule for high ws failure 2022-04-18 22:09:31 +05:30
Prince Rachit Sinha
1d937922ca [observability] Update success criteria formula 2022-04-06 22:22:20 +05:30
Prince Rachit Sinha
568ed40373 [dashboard] Update variable refresh option 2022-04-05 03:12:18 +05:30
Prince Rachit Sinha
d1c610e55d [dashboard] Update success criteria dashboard 2022-04-04 15:24:18 +05:30
Prince Rachit Sinha
d73025c49b [observability] Update agent smith graph 2022-03-24 11:31:07 +05:30
Prince Rachit Sinha
306c7e9179 [observability] Add agent-smith egress violations graph 2022-03-17 18:01:23 +05:30
Gero Posmyk-Leinemann
76ef1af38e [ops] Add alert 'InstanceStartFailures' 2022-03-07 22:30:14 +05:30
Gero Posmyk-Leinemann
8aa11bd566 [ops] WebApp Overview: Add graph for "Instance Starts Success/Failure Rate" 2022-03-07 22:30:14 +05:30
Prince Rachit Sinha
aea24d85f8 Update runbook url for GitpodWsDaemonExcessiveGC 2022-03-01 21:19:08 +05:30
Pavel Tumik
ebb2a33667 change alert period for ws stuck on starting or stopping to 20m 2022-03-01 06:22:08 +05:30
Manuel Alejandro de Brito Fontes
90fe82a508 Remove ghost from the codebase 2022-02-28 14:17:07 +05:30
Prince Rachit Sinha
a48e177120 Add alerts for excessive GC of ws-daemon 2022-02-14 11:25:35 +01:00