Founder of Miget.com – a Zero-Ops PaaS with fair scheduling, microVMs, etc. Focus: infra simplicity, container runtimes, and developer-first platforms. Building in Ruby/Python, and Go, with experience in Kubernetes and low-level Linux systems.
The routing layer being down while workers/data services stay up is such a specific failure mode. Usually means the load balancers or edge routing got corrupted somehow, not the actual compute infrastructure.
If you're serious about migrating off (and not just saying it in the heat of the moment), the main thing is having a plan for the database migration. That's always the painful part. Everything else is just Docker containers that run anywhere.
Yeah, if Heroku's cert rotation depends on Google's CA and it tried to renew during the outage window, that'd definitely cause problems. The 8-hour ETA is rough. This is why multi-CA fallback configs exist, but most platforms don't bother until they get burned by something like this. Worth checking if your apps are actually affected or if it's just the dashboard/API having issues.
The fact that your database is reachable but the app isn't after restart points to the routing layer being toast. If you can ssh into the dyno or check logs, look for anything related to the router handshake timing out. Usually when deploys work but traffic doesn't route, it's something between the load balancer and your instances.
The correlation with scheduled restarts is interesting though. Makes me wonder if there's a cert validation issue on boot that's causing new instances to fail health checks.
This project is an enhanced reader for Ycombinator Hacker News: https://news.ycombinator.com/.
The interface also allow to comment, post and interact with the original HN platform. Credentials are stored locally and are never sent to any server, you can check the source code here: https://github.com/GabrielePicco/hacker-news-rich.
For suggestions and features requests you can write me here: gabrielepicco.github.io