Breaking GitLab - Chaos Engineering Practice

GitLab 502 Error

The Concept

I self-host Gitea, and like a lot of people using Git, have a GitHub account, but I got really interested in GitLab, so decided to give it a bash. I quickly decided GitLab was my preference, using pipelines for the most basic things were a breeze (I love a little automation, even if it's just to update my website with a runner). It's not hugely different from other repo platforms, but enough so that I wanted to really dig into the barebones of it. I spun up a container, and as I often believe the best way to learn how something works is to try and break it, I came up with my "chaos script" idea. The script creates realistic failure scenarios to practice GitLab investigation and recovery processes. Basically, the chaos script randomly breaks components without revealing the root cause. However, writing the script that causes the issue made it too easy for me, so I got Claude to spin one up, it's not perfect, but the model did a pretty good job, and solving the challenges became harder.

Lab Environment

  • Proxmox LXC Container - Debian 12, 4 CPU cores, 8GB RAM
  • GitLab CE with custom configuration for resource constraints
  • Dell OptiPlex 5040 host (my little 'Dullbox' that serves as my homelab) - i7-6700, 32GB RAM total

Scenarios Explored

Load Stress Testing

After a lot of attempts to find out what had no effect, and what crashed my container, I found a decent request range, that I thought might reflect a small-med dev team as well. I ended up throwing 39 concurrent HTTP requests at the instance to simulate real user load. Response times climbed from 0.12s to 2.3s, doesn't sound huge, but that's something like a 20x slowdown, and it became clear pretty quickly that the database was the primary bottleneck under load. Not a huge surprise in hindsight, but good to confirm. I used custom bash scripts alongside jq to work through the JSON logs.

Database and Filesystem Corruption

This one got interesting. The script used the Rails console to corrupt some metadata directly, and simulated filesystem damage by moving repository files around. What I didn't expect was how long GitLab's caching layer masked it. The corruption was there, but the cache was still serving stale data, so everything looked fine from the outside for a while. That sent me down a rabbit hole learning a lot more about GitLab's caching approach than I'd planned. Once I knew to look, gitlab-rake gitlab:git:fsck was the most reliable tool for actually surfacing the damage.

CI/CD Token Corruption

This run, the chaos script corrupted the runner config to cause authentication failures. The result was runners that appeared healthy in the UI but couldn't claim jobs, so pipelines would sit in "pending" indefinitely with no obvious explanation. Sidekiq was reporting fine. The tell was checking the job requests directly and finding 403 Forbidden responses. "Silent" failures like this in any environment can be the most frustrating to diagnose, though often also the most satisfying to identify.

Chaos Script Results

A few patterns came out of the chaos script runs worth noting. Invalid configuration syntax often goes completely undetected until gitlab-ctl reconfigure runs, which I'd naively assumed would have some upfront validation. Service state reporting can also be misleading, the database showing "online" for a service that's already dead being a good example. One other interesting find: external URL corruption breaks the web UI redirect while leaving git operations entirely intact, which makes it especially confusing to diagnose if you don't know to check that.

Technical Skills Developed

Going through all of this improved my confidence with jq for digging through GitLab's JSON logs, and the Rails console for investigating database state directly. More than any specific tool though, it pushed me towards building a more methodical approach to diagnosing problems across multiple interdependent services - something I'd been meaning to work on anyway.

Key Insights

The caching layer was probably the biggest "aha" moment, multiple data layers can hide corruption long enough to send you completely down the wrong path. Using a few different diagnostic tools together rather than trusting any single one made a big difference. And GitLab under load is genuinely impressive; I went in expecting the self-hosted instance to struggle under ~20 concurrent connections, and it held up much better than I expected.

What I Learned

The most valuable lesson was learning to distinguish between symptoms and root causes. A 502 error might be a web server issue, but it could also be database problems, filesystem corruption, or network issues. The chaos script forced me to develop a methodical troubleshooting approach rather than jumping to conclusions.

Another key insight: GitLab's architecture is incredibly complex, with multiple services that can fail independently. Understanding the dependencies between Puma, Sidekiq, PostgreSQL, Redis, and Gitaly is essential for effective troubleshooting, and not runnning full pelt down the wrong investigation path.

Next Steps

  • Expand chaos scenarios to include network partitions and tougher resource exhaustion
  • Document recovery playbooks for common failures / problems
  • Create a monitoring dashboard for the container, integrated into HomeTown to practice proactive detection
  • Apply these chaos engineering principles to other services in my home lab

Resources

View full project details and the 'chaos' script on GitLab →

Last updated: