GitLab pages issues today again? (and report on 2023-10-30: Gitlab.com is down (#17054) · Issues · GitLab.com / GitLab Infrastructure Team / production · GitLab)
Posted by jpluimers on 2024/03/12
Still working on handling open Chrome tabs after having moved in the period that GitLab had quite a few issues causing my PagerDuty alerts to go wild.
Today PagerDuty gave me 7 calls in 4 hours again (see [Wayback/Archive] Jeroen Wiert Pluimers @wiert@mastodon.social on X: “@gitlab Since 20240312T1727Z I get PagerDuty alerts from HetrixTools for some pages hosted on GitLab. It would be nice if someone could have a look at gitlab.com/gitlab-com/gl-infra/production/-/issues/17717“).
In adddition I need to check if anything made it to the GitLab issue list from the 20230827 connectivity issues I mentioned at [Wayback/Archive] Jeroen Wiert Pluimers @wiert@mastodon.social on X: “Is it @gitlab hosting having transcontinental issues, or are other continental connections affected as well? These are from two different *.gitlab.io pages as measured via @HetrixTools . No issues are listed at status.gitlab.com“.
Back then, this was the most important one: [Wayback/Archive] GitLab System Status: GitLab.com availability issues – October 30, 2023 15:39 UTC
Likely because of this, wiert.me.gitlab.io had been down for a while as well on 20231031 (see [Wayback/Archive] wiert.me.gitlab.io (Recent History) – HetrixTools down from 2023-10-30T15:24Z until 2023-10-30T16:14Z for 3 + 3 + 11 + 27 = 44 minutes.)
Back then, the hardest part was to quickly find out if there was indeed an issue being investigated at all.
The GitLab status multi-media account on Twitter just points to the status page, which makes it hard to find the underlying issue.
I didn’t archive that one in time, but when I got the alerts it didn’t show anything and when it was resolved it was already beyond the cut-off timestamp to mark it as “same day” and the graph didn’t show much down-time [Wayback/Archive] GitLab System Status graph didn’t show much down-time:
GitLab.com downtime graph from 20231002 through 20231031 (image saved at [Wayback/Archive] 20231031 SVG status graph of https://status.gitlab.com/ missing the downtime at https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17054 and https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17057 · GitHub)
After a bit of searching via [Wayback/Archive] GitLab System Status: 2023.10.31 and [Wayback/Archive] GitLab System Status history (2023-10-30 and earlier), I found [Wayback/Archive] GitLab System Status: GitLab.com availability issues; Components=Website, API, Git Operations; Locations=Google Compute Engine; October 30, 2023 15:39 UTC which directed to these two issues:
- Main issue: [Wayback/Archive] 2023-10-30: Gitlab.com is down (#17054) · Issues · GitLab.com / GitLab Infrastructure Team / production · GitLab
- Post mortem: [Wayback/Archive] Incident Review: 2023-10-30 Gitlab.com is down (#17057) · Issues · GitLab.com / GitLab Infrastructure Team / production · GitLab (which mentioned a total downtime of 48 minutes of which HetrixTools measured 44 minutes for my domain).
Lesson learned: as soon as GitLab status turns non-green, start watching and refreshing [Wayback/Archive] Issues · GitLab.com / GitLab Infrastructure Team / production · GitLab as soon as it is up to see more information than the status page or the below tweets.
The same for the [Wayback/Archive] GitLab.com Status (@gitlabstatus) / Twitter tweet time-line on 20231030:
- [Wayback/Archive] We are currently investigating availability issues with on
GitLab.com. We will update as more information becomes available.status.gitlab.com - [Wayback/Archive] We are investigating increased database usage as a primary cause. We will update as more information becomes available.
status.gitlab.com - [Wayback/Archive] The incident response team is continuing to investigate causes of the increased database usage, additional teams are being brought in to assist.
status.gitlab.com - [Wayback/Archive] We’ve confirmed potential impacts to Git operations and API as well. Investigation still ongoing.
status.gitlab.com - [Wayback/Archive] We are seeing signs of improvement, and the team has narrowed the potential causes. Mitigation and additional investigation are in progress.
status.gitlab.com - [Wayback/Archive] A likely cause of the database load has been identified. Our incident response team has temporarily disabled part of the project import feature while we continue to investigate.
status.gitlab.com - [Wayback/Archive] We continue to work towards full mitigation. Performance of http://GitLab.com is returning, though project import remains partially disabled while the team investigates.
status.gitlab.com - [Wayback/Archive] Performance continues to improve and the team is continuing to work on mitigations to prevent future impact.
status.gitlab.com - [Wayback/Archive] The incident response team has mitigated the root cause of the performance impacts and is now working to restore full import functionality. Monitoring is ongoing to ensure there are no further impacts. Issue:
gitlab.com/gitlab-com/gl-infra/production/-/issues/17054status.gitlab.com
What you see in the above tweets all but the final one just references to the current status.gitlab.com page, but not the actual underlying incident page. Having a link there early on would be way more helpful as the incident has a lot more information and can be used to amend with comments that might be crucial to get the issue resolved.
Today’s issue: [Wayback/Archive] SSL certificates issues on some of the GitLab hosted pages (#17717) · Issues · GitLab.com / GitLab Infrastructure Team / production · GitLab
The issue has since then be moved to [Wayback/Archive] SSL certificates issues on some GitLab hosted pages with custom domains (#450288) · Issues · GitLab.org / GitLab · GitLab.
For now, the verdict is to make DNS more responsive first.
TLS issues that might be related:
- [Wayback/Archive] 2022-09-14: Pages sites with multiple domains experiencing Let’s Encrypt issues (#7741) · Issues · GitLab.com / GitLab Infrastructure Team / production · GitLab
- [Wayback/Archive] 2023-02-19: airflow.gitlabdata.com certificate expired (#8420) · Issues · GitLab.com / GitLab Infrastructure Team / production · GitLab
- [Wayback/Archive] Issue to use letsencrypt potentially with cert manager on gke to certificates to automatically renew TLS certificate of Airflow (#8427) · Issues · GitLab.com / GitLab Infrastructure Team / production · GitLab
- [Wayback/Archive] renew TLS certificate manually for airflow.gitlabdata.com – expires 2024-02-19 (#24977) · Issues · GitLab.com / GitLab Infrastructure Team / Production Engineering · GitLab
- [Wayback/Archive] Certificate problems with multiple pages to the same repo (#373220) · Issues · GitLab.org / GitLab · GitLab
- [Wayback/Archive] Switch Teleport to use DNS for letsencrypt certificates (#12860) · Issues · GitLab.com / GitLab Infrastructure Team / Production Engineering · GitLab
--jeroen






Leave a comment