The worst part of all this is that GitHub's CTO and VP of Engineering sent out the usual "here's what we'll do to fix things" letter to their larger customers and, without exaggeration, it boiled down to: 1) "Here's a bunch of stuff we already did!" which... clearly isn't working, and 2) "We're continuing our Azure migration." also clearly not working.
So needless to say, if you depend on GitHub for critical business operations, you need to start thinking about what a world without GitHub looks like for your business and start working your way toward that. I know my confidence in GitHub's engineering leadership is at rock bottom.
I could sorta see a situation where the reality is "we're in the middle of a miserable transition and it'll clean up when we're done" but I don't think anyone has confidence that's all it is at this point.
Even that doesn’t really make sense to me, unless they’ve done it in a way where everything has to move at once.
Everywhere I’ve worked, if a migration is causing this much downtime then you kill the migration or slow it down. If every change has a 10% chance of bringing the site down, you only do a change every week or two until you can work out the kinks.
Here are some relevant excerpts from an October 2025 article[1]:
> In a message to GitHub’s staff, CTO Vladimir Fedorov notes that GitHub is constrained on capacity in its Virginia data center. “It’s existential for us to keep up with the demands of AI and Copilot, which are changing how people use GitHub,” he writes.
> The plan, he writes, is for GitHub to completely move out of its own data centers in 24 months. “This means we have 18 months to execute (with a 6 month buffer),” Fedorov’s memo says. He acknowledges that since any migration of this scope will have to run in parallel on both the new and old infrastructure for at least six months, the team realistically needs to get this work done in the next 12 months.
If you consider that six month parallel window to have started from the time of the October memo (written presumably at the start of October), then that puts us currently or past the point where they would have cut off their old DC and defaulted to Azure only.
Whether plans or timelines changed, I have no idea of course but the above does make for a convenient timeline that would explain the recent instability. Of course, it could also just be symptomatic of increased AI usage generally and the same problems might have surfaced at a software level regardless of whether they were in a DC or on Azure.
Putting that nuance aside, personally I like the idea that Azure is simply a giant pile of shit operated by a corporation with no taste.
>It’s existential for us to keep up with the demands of AI and Copilot
if by chance the CTO reads this, as a user of GitHub I would find it really existential if GitHub continues functioning as a reliable hub for git workflows (hence the name), and I have the strong suspicion nobody except for the shareholders gives a lick about copilot or 'AI' if it makes the core service the site was designed for unusable
Why? What is the correlation between profit and shareholder sentiment (besides the fact that shareholders want said profits)? They don't really influence the operation of the business meaningfully.
Incorrect. They need to appease/trick/threaten/etc those that are paying for their services. Shareholders just demand they do so at the greatest (often short term) rate.
i heard that they asked LinkedIn to do this too and they either refused or their systems were too complex so they refused to. Maybe that explains why LI availability seems ok
It's starting to really look like the AI effect. It might be coincidence but I've noticed a lot more downtime and bad software lately. The last Nvidia drivers gave me a blue screen (last week or so), and speaking about Windows, I froze updates last year because it was clear they were introducing a bunch of issues with every update (not to mention unwanted features).
I like AI but actually not for coding because code quality is correlated to how well you understand the underlying systems you're building on, and AI is not really reasoning on this level at all. It's clearly synthesizing training data and it's useful in limited ways.
Interesting how many people "Like AI" because it's good at all the jobs other than the one they happen to make a living doing.
Did you hear about the screenwriters school in which the professors said to avoid AI for writing, but it's great for storyboards. And the storyboard school where the professors said the opposite?
The reality is that AI isn't actually "good" at anything. It produces passable ersatz facsimiles of work that can fool those not skilled in the art. The second reality of AI is that everyone is busy cramming it into their products at the expense of what their products are actually useful for.
Once people realise (1), and stop doing (2), the tech industry has a chance of recovering.
Yeah, I think I heard about that. Within certain domains it is certainly a useful tool. I would say things like online search are much nicer now (in that asking an AI is equivalent to searching online but it summarizes it for you). Online search fits the strengths of LLMs nicely, but right now it's being sold as a silver bullet, which it's not.
GitHub has been unreliable since before AI. Though it's definitely gotten far worse.
Seemingly the decline started with the Microsoft acquisition in 2018, and subsequent "unlimited private repository" change in 2019 (to match Gitlab's popular offer)
One example is the search being broken for CI logs. It takes over your browser's search hotkey too. What happens is every stage of the log is collapsed so the search doesn't work until you trigger the expansion but if you attempt to search before expanding the search will never work after it's been initialized. It's pretty infuriating when you're trying to find something in a giant build log.
I’m still baffled that Minecraft is doing so well, despite the whole Bedrock thing. At this point I think Microsoft just forgot that they bought Mojang.
Its had its fair share of outages and outrageous changes that overreach the bounds as well. Its more stable than github is but its had at least 2 sessions of downtime this year that I recall and they were both quite long (day length).
They'd lose a whole lot of users if they killed Java edition, since the modded community is so large. They'd quickly find one of the Minecraft clones reaching feature parity. And there's no good reason for it - it's not like Java is a threat anymore.
Exactly. So why isn't Microsoft doing just that? Isn't that how Microsoft usually handles things? Just look at Xbox. They essentially screwed up everything they could and then some.
I don't remember that happening so much (if ever) in, say, 2016. But the frequency of noticeable incidents seemingly has been rising steadily since around 2023. The Azure migration apparently only exacerbated it.
I remember seeing unicorn daily and "webhook delivery delayed" weekly. I think it got better, but also they got more traffic, now millions of agents read files separately over and over again.
I remember it going down semi-regularly in the 2013+ era, and seeing HN posts about it. Especially if you were using a package manager reliant on GitHub like Cocoapods. It seems to me it is more "impactful" on the dev community now that they have gone past just being a centralized Git server for the team, to being the thing that does deploys and all sorts of other things.
And a ton of the top end ruby staff have left. Many of them ended up at shopify. There is a growing about of non ruby/rails code at github, but most of the system that people think of when they think github are ruby/rails.
I wonder what the average career tenure of the userbase here is now, because Github was slow and flaky well before Microsoft got involved.
Maybe it wasn't as noticeable when Github had less features, but our CI runners and other automation using the API a decade ago always had weekly issues caused by Github being down/degraded.
It hosts all the repositories backing applycreatures, we ran dozens of git projects on the same instance, have teams, you guys did a phenomenal work. I would say it's even easy to customise.
Man, a while ago I thought: "It happens often, alright, but every 2 weeks? Sounds like a slight exaggeration." But it really is every 2 weeks, isn't it? If I imagine in a previous job anything production being down every 2 weeks ... phew, would have had to have a few hard talks and course corrections.
i once fixed a site going down several times a year with two t1.micro instances in the same region as the majority of traffic. Instantly solved the problem for what, $20/month?
Another site was constantly getting DDoS by Russians who were made we took down their scams on forums, that had to go through verisign back then, not sure who they're using now. They may have enough aggregate pipe it doesn't matter at this point
There are so many failures in microservices that just can't happen with a local binary. Inter-service communication over network is a big one with a failure rate orders of magnitude higher than running a binary on the same machine. Then you have to do deploys, monitoring, etc. across the whole platform.
You will basically need to employ solutions for problems only caused by your microservices arch. E.g. take reading the logs for a single request. In a monolith, just read the logs. For the many-service approach, you need to work out how you're going to correlate that request across them all.
Even the aforementioned network failures require a lot of design, and there's no standardization. Does the calling service retry? Does the callee have a durable queue and pick back up? What happens if a call/message gets 'too old'?
Also, from the other end, command line utils are typically made by entirely different people with entirely different philosophies/paradigms, so the encapsulation makes sense. That's not true when you're the one writing all the services, especially not at small-to-mid-size companies.
Plus, you already can do the single-concern thing in a monolith, just with modules/interfaces/etc.
One strategy to convince is to get someone less technical than you to sit by you while you try and trace everything from one error'd HTTP request from start to finish to diagnose the problem. If they see it takes half a day to check every call to every internal endpoint to 100% satisfy a particular request sometimes that can help.
Also sometimes they just think "this is a bunch of nerd stuff, why are you involving me?!" So it's not foolproof.
Oh, my non-technical boss agrees with me already. It's actually the engineers who've convinced themselves it's a good setup. Nice guys but very unwilling to change. Seems they're quite happy to have become 'experts' in this mess over the last 5-10 years. Almost like they're in retirement mode.
The real solution is probably to leave, but the market sucks at the moment. At least AI makes the 10-repos-per-tiny-feature thing easier.
Does anyone else ever think "that code I just pushed into my repo just took down all of github..." whenever it goes down around the same time you sync your changes?
Just moved a project of mine to Gitlab. Created this very simple component with codex that will keep a mirror updated on GitHub for me, so I can focus development on Gitlab.
I'm surprised nobody has tried to throw together a commercial alternative to GitHub. 50% of it is available as FOSS, the other 50% you can vibecode in a month (you can vibecode reliably, Microsoft/Google just suck at it). Afaict, reason we all keep using GitHub is it has a million features and isn't as ugly, difficult and slow as GitLab. (sorry GitLab, I love your handbook, hate your UX)
I've been using "slopocalypse". People already know AI is responsible, but slop existed before — e.g. conventionally generated SEO spam. It's just... so much worse now.
> GitHub has recently seen more outages, in part because its central data center in Virginia is indeed resource-constrained and running into scaling issues. AI agents are part of the problem here. But it’s our understanding that some GitHub employees are concerned about this migration because GitHub’s MySQL clusters, which form the backbone of the service and run on bare metal servers, won’t easily make the move to Azure and lead to even more outages going forward.
Age-old lesson: change the tires on the moving vehicle that is your business when it's a Geo Metro, not when it's a freight train.
I'm sure the people with the purse strings didn't care, though, and just wanted to funnel the GH userbase into Azure until the wheels fell off, then write off the BU. Bought for $7.5B, it used to make $250M, but now makes $2B, so they could offload it make a profit. I wonder who'll buy it. Prob Google, Amazon, IBM, Oracle, or a hedge fund. They could choose not to sell it, but it'll end up a writeoff if the userbase jumps ship.
I assume this is all of the pains of going from "GHA is sorta kinda on Azure", which was a bad state, to "GHA is going full Azure", which is a painful state to get to but presumably simplifies things.
> Any massive infra migration is going to cause issues.
What? No, no it's not. The entire discipline of Infrastructure and Systems engineering are dedicated to doing these sorts of things. There are well-worn paths to making stable changes. I've done a dozen massive infrastructure migrations, some at companies bigger than Github, and I've never once come close to this sort of instability.
This is a botched infrastructure migration, onto a frankly inferior platform, not something that just happens to everyone.
I remember back in the early Windows XP era when things got so bad that Microsoft basically had to make a hard pivot towards security and reliability.
I think they may need to do that once again. Almost every product of theirs feels like a dumpster fire. GitHub is down constantly, Windows 11 is a nightmare and instead of patching things they're adding stupid features nobody asked for. I think they need to stop and really look closely at what they're prioritizing.
I remember. My GitHub user ID is #5907, account created 2008-04-08T20:27:36Z. I think it is inevitable that all good things come to an end, but it's still a bummer to see.
I've been sitting here waiting for a critical deploy to happen via GitHub Actions (I know, hour fault, we should have left ages ago). My patience for this bullshit is gone, I'm going to be pushing very hard to get us off of GitHub entirely except for public code mirrors going forward.
Edit: oh look, their site says all good, but I still have jobs stuck. What a pile of garbage.
So am I the only one thinking that maybe GitHub is succumbing to the weight of AI slop that's coming in from all the vibecoding, clawbots, and other AI workflows?
Github CEO must be on HN, right? If so, any comments?
They have not even bothered to implement entra login when they have their competitors login for years, do they even know what their product is? Or are you just a middle man for slop?
GitHub goes down at least once a week as I said before. [0] thanks to Copilot, Tay.ai and Zoe chatbots wrecking the platform instead of humans maintaining it.
If there was a prediction market for when GitHub experiences an outage every week, then you would make a lot of money.
>GitHub goes down at least once a week as I said before. [0] thanks to Copilot, Tay.ai and Zoe chatbots wrecking the platform instead of humans.
there are tens of thousands of stupid scripts hosted on github itself that have scheduled progmatic pushes or pulls to repos via cron jobs with millions and millions of users -- yeah LLMs accelerate the fire but let's not pretend that GH was some bastion of real-user-dom somehow at some point.
Sorry, I realise this comment isn't up to HN's usual standards for thoughtfulness and it is perhaps a bit inflammatory but... look, I'd bet the majority of us on this site rely on GitHub and I can't be the only one becoming incredibly frustrated with its recent unreliability[0]?
(And, yes, I did enough basic data analysis to confirm that it IS indeed getting worse versus a year, two years, and three years ago, and is particularly bad since the start of this year.)
[0] EDIT: clearly not from looking at the rest of the comments in this discussion.
@KaiserPro has pasted the link to someone else's heatmap, which is really good. Mine was just an Excel spreadsheet with a graph that I'd intended to write a blog about but then got demotivated on because I was too busy with other things and I saw that heatmap as well. Maybe I will do a proper write up next time GitHub has an outage and I'm blocked by it.
Why don't companies with chronic outages mimic their stack from top to bottom (i.e. starting with a new domain), then before making a change, make the change on the duplicate stack and blast it with mock requests.
Might catch 90% of problems before they make it into the real stack?
E.g. every step of GitHub's migration to Azure could be mimicked on the duplicate stack before it's implemented on the primary stack. Is this just considered too much work? (I doubt cost would be the issue, because even if it costs millions, it would pay for itself in reduced reputational damage from outages).
EDIT: downvotes - why? - I think this is a good idea (I'd do it for my sites if outages were an issue).
> EDIT: downvotes - why? - I think this is a good idea (I'd do it for my sites if outages were an issue).
Because that's a monumental amount of work, and extraordinarily difficult to retrofit into a system that wasn't initially designed that way. Not to mention the unstated requirement of mirroring traffic to actually exercise that system (given the tendency of bugs to not show up until something actually uses the system).
So needless to say, if you depend on GitHub for critical business operations, you need to start thinking about what a world without GitHub looks like for your business and start working your way toward that. I know my confidence in GitHub's engineering leadership is at rock bottom.
Everywhere I’ve worked, if a migration is causing this much downtime then you kill the migration or slow it down. If every change has a 10% chance of bringing the site down, you only do a change every week or two until you can work out the kinks.
> In a message to GitHub’s staff, CTO Vladimir Fedorov notes that GitHub is constrained on capacity in its Virginia data center. “It’s existential for us to keep up with the demands of AI and Copilot, which are changing how people use GitHub,” he writes.
> The plan, he writes, is for GitHub to completely move out of its own data centers in 24 months. “This means we have 18 months to execute (with a 6 month buffer),” Fedorov’s memo says. He acknowledges that since any migration of this scope will have to run in parallel on both the new and old infrastructure for at least six months, the team realistically needs to get this work done in the next 12 months.
If you consider that six month parallel window to have started from the time of the October memo (written presumably at the start of October), then that puts us currently or past the point where they would have cut off their old DC and defaulted to Azure only.
Whether plans or timelines changed, I have no idea of course but the above does make for a convenient timeline that would explain the recent instability. Of course, it could also just be symptomatic of increased AI usage generally and the same problems might have surfaced at a software level regardless of whether they were in a DC or on Azure.
Putting that nuance aside, personally I like the idea that Azure is simply a giant pile of shit operated by a corporation with no taste.
[1]: https://thenewstack.io/github-will-prioritize-migrating-to-a...
if by chance the CTO reads this, as a user of GitHub I would find it really existential if GitHub continues functioning as a reliable hub for git workflows (hence the name), and I have the strong suspicion nobody except for the shareholders gives a lick about copilot or 'AI' if it makes the core service the site was designed for unusable
I wonder if the extended downtime is just due to the on-call engineers waiting for their azure auth tokens to refresh within azure's own damn network.
I like AI but actually not for coding because code quality is correlated to how well you understand the underlying systems you're building on, and AI is not really reasoning on this level at all. It's clearly synthesizing training data and it's useful in limited ways.
Did you hear about the screenwriters school in which the professors said to avoid AI for writing, but it's great for storyboards. And the storyboard school where the professors said the opposite?
The reality is that AI isn't actually "good" at anything. It produces passable ersatz facsimiles of work that can fool those not skilled in the art. The second reality of AI is that everyone is busy cramming it into their products at the expense of what their products are actually useful for.
Once people realise (1), and stop doing (2), the tech industry has a chance of recovering.
Seemingly the decline started with the Microsoft acquisition in 2018, and subsequent "unlimited private repository" change in 2019 (to match Gitlab's popular offer)
IMO it's much better now.
https://trends.google.com/trends/explore?date=all&geo=GB&q=s...
Maybe it wasn't as noticeable when Github had less features, but our CI runners and other automation using the API a decade ago always had weekly issues caused by Github being down/degraded.
Would you like help?
- Get help with developing the software
- Just develop the software without help
[ ] Don't show me this tip again"
FTFY. (I've read AWS word it like that)
https://foja.applycreatures.com
Edit: it has a wonderful API so I posted the link it may tempt some to ditch MS/Azure hub.
Another site was constantly getting DDoS by Russians who were made we took down their scams on forums, that had to go through verisign back then, not sure who they're using now. They may have enough aggregate pipe it doesn't matter at this point
Currently consulting somwhere with 30 services per engineer. I cannot convince them this is hell. Maybe that makes it my personal hell.
how is such service spam different from unix "small functions that do one thing only" culture?
why in unix case it is usually/historically seen as nice, while in web case it makes stuff worse?
You will basically need to employ solutions for problems only caused by your microservices arch. E.g. take reading the logs for a single request. In a monolith, just read the logs. For the many-service approach, you need to work out how you're going to correlate that request across them all.
Even the aforementioned network failures require a lot of design, and there's no standardization. Does the calling service retry? Does the callee have a durable queue and pick back up? What happens if a call/message gets 'too old'?
Also, from the other end, command line utils are typically made by entirely different people with entirely different philosophies/paradigms, so the encapsulation makes sense. That's not true when you're the one writing all the services, especially not at small-to-mid-size companies.
Plus, you already can do the single-concern thing in a monolith, just with modules/interfaces/etc.
In that every night you're playing murder mystery, and its never fun.
One strategy to convince is to get someone less technical than you to sit by you while you try and trace everything from one error'd HTTP request from start to finish to diagnose the problem. If they see it takes half a day to check every call to every internal endpoint to 100% satisfy a particular request sometimes that can help.
Also sometimes they just think "this is a bunch of nerd stuff, why are you involving me?!" So it's not foolproof.
The real solution is probably to leave, but the market sucks at the moment. At least AI makes the 10-repos-per-tiny-feature thing easier.
99.99
99.90
99.00
90.00
https://gitlab.com/gabriel.chamon/ci-components/-/tree/main/...
That helps with Git not so much issues etc.
At any rate, it seems like GitHub is back up now, so we'll see how long that lasts.
https://www.forbes.com/sites/bernardmarr/2025/07/08/microsof...
To explain this one-word comment for those unfamiliar, see previously:
GitHub will prioritize migrating to Azure over feature development (5 months ago) https://news.ycombinator.com/item?id=45517173
In particular:
> GitHub has recently seen more outages, in part because its central data center in Virginia is indeed resource-constrained and running into scaling issues. AI agents are part of the problem here. But it’s our understanding that some GitHub employees are concerned about this migration because GitHub’s MySQL clusters, which form the backbone of the service and run on bare metal servers, won’t easily make the move to Azure and lead to even more outages going forward.
I'm sure the people with the purse strings didn't care, though, and just wanted to funnel the GH userbase into Azure until the wheels fell off, then write off the BU. Bought for $7.5B, it used to make $250M, but now makes $2B, so they could offload it make a profit. I wonder who'll buy it. Prob Google, Amazon, IBM, Oracle, or a hedge fund. They could choose not to sell it, but it'll end up a writeoff if the userbase jumps ship.
What? No, no it's not. The entire discipline of Infrastructure and Systems engineering are dedicated to doing these sorts of things. There are well-worn paths to making stable changes. I've done a dozen massive infrastructure migrations, some at companies bigger than Github, and I've never once come close to this sort of instability.
This is a botched infrastructure migration, onto a frankly inferior platform, not something that just happens to everyone.
Artificial intelligence, Azure integration, many other things.
https://about.gitea.com/
I think they may need to do that once again. Almost every product of theirs feels like a dumpster fire. GitHub is down constantly, Windows 11 is a nightmare and instead of patching things they're adding stupid features nobody asked for. I think they need to stop and really look closely at what they're prioritizing.
I can’t be specific but we are constantly complaining.
Edit: oh look, their site says all good, but I still have jobs stuck. What a pile of garbage.
I'm so sick of this.
They have not even bothered to implement entra login when they have their competitors login for years, do they even know what their product is? Or are you just a middle man for slop?
I've been considering it for a while, but I'm definitely now pitching a move away from GitHub at our organization.
If there was a prediction market for when GitHub experiences an outage every week, then you would make a lot of money.
[0] https://news.ycombinator.com/item?id=47487881
there are tens of thousands of stupid scripts hosted on github itself that have scheduled progmatic pushes or pulls to repos via cron jobs with millions and millions of users -- yeah LLMs accelerate the fire but let's not pretend that GH was some bastion of real-user-dom somehow at some point.
Sorry, I realise this comment isn't up to HN's usual standards for thoughtfulness and it is perhaps a bit inflammatory but... look, I'd bet the majority of us on this site rely on GitHub and I can't be the only one becoming incredibly frustrated with its recent unreliability[0]?
(And, yes, I did enough basic data analysis to confirm that it IS indeed getting worse versus a year, two years, and three years ago, and is particularly bad since the start of this year.)
[0] EDIT: clearly not from looking at the rest of the comments in this discussion.
> And, yes, I did enough basic data analysis to confirm
Perhaps you'd consider showing us that analysis? That sounds like it would make a pretty substantive, thoughtful comment.
Gaze upon the tapestry in which github paints it's failure with a thin copper red thread:
https://www.githubstatus.com/
Might catch 90% of problems before they make it into the real stack?
E.g. every step of GitHub's migration to Azure could be mimicked on the duplicate stack before it's implemented on the primary stack. Is this just considered too much work? (I doubt cost would be the issue, because even if it costs millions, it would pay for itself in reduced reputational damage from outages).
EDIT: downvotes - why? - I think this is a good idea (I'd do it for my sites if outages were an issue).
Because that's a monumental amount of work, and extraordinarily difficult to retrofit into a system that wasn't initially designed that way. Not to mention the unstated requirement of mirroring traffic to actually exercise that system (given the tendency of bugs to not show up until something actually uses the system).