Every SWE knows DOGE can't rewrite Social Security in a few months
DOGE plans to port the entirety of Social Security from COBOL to Java, previously estimated at 5 years, within a few months. I explain why that's not happening.
Per Wired, DOGE plans to port the Social Security Administration’s codebase from COBOL to Java. They plan to port the entire 60-million-line codebase on the order of a few months.
DOGE Plans to Rebuild SSA Code Base in Months, Risking Benefits and System Collapse
The so-called Department of Government Efficiency (DOGE) is starting to put together a team to migrate the Social Security Administration’s (SSA) computer systems entirely off one of its oldest programming languages in a matter of months, potentially putting the integrity of the system—and the benefits on which tens of millions of Americans rely—at risk.
The project is being organized by Elon Musk lieutenant Steve Davis, multiple sources who were not given permission to talk to the media tell WIRED, and aims to migrate all SSA systems off COBOL, one of the first common business-oriented programming languages, and onto a more modern replacement like Java within a scheduled tight timeframe of a few months.
My initial reaction was that maybe they could generate a compiling program within a few months, but it would take years to have confidence that it is doing the right thing. Thankfully, an expert at the SSA confirmed my gut reaction.
In order to migrate all COBOL code into a more modern language within a few months, DOGE would likely need to employ some form of generative artificial intelligence to help translate the millions of lines of code, sources tell WIRED. “DOGE thinks if they can say they got rid of all the COBOL in months, then their way is the right way, and we all just suck for not breaking shit,” says the SSA technologist.
DOGE would also need to develop tests to ensure the new system’s outputs match the previous one. It would be difficult to resolve all of the possible edge cases over the course of several years, let alone months, adds the SSA technologist.
It’s not hard to imagine that they will somehow produce a codebase that compiles in a few months, given enough LLM hours and SWE hours. There are examples like AirBNB where teams could accelerate long-term migrations by an order of magnitude. However, AirBNB had a few advantages over the SSA port:
The conversion was done by the same organization that wrote the tests.
The port was between two libraries in the same language.
Violating these two constraints has a host of interesting consequences that make this unlikely to succeed. And even if it did succeed, it would take years before you believed it worked. 5 years ago on my personal blog, I wrote about the most likely outcome, which is that they are going to crash across the finish line and declare success despite all the carnage.
Problem one: the conversion itself
IBM has a tool called Watsonx that automatically converts programs from COBOL to Java. Read about how IBM’s tool works. It goes file-by-file and generates a Java file that matches the provided COBOL. At this point, you are supposed to compare the COBOL and the generated Java and fix any mistakes in the Java. This means that you need to have enough experience with both languages to decide that the output is the same as the input. That’s a bit alarming given the fact that DOGE confused run-of-the-mill COBOL date sentinels with 150-year-olds getting Social Security benefits.
This is further exacerbated by the fact that the DOGE engineers are not the engineers that produced the code. Seemingly small differences can easily introduce new bugs in previously-working code.
Furthermore, can you imagine actually spending 12 hours a day staring at before/after versions of Java and COBOL? What are the odds that they “full send” some of the conversions without really comparing the before and after?
Problem two: conversion quality
I read forum posts from COBOL programmers. They say that the tools do a 90% job because there are differences between the mainframe and non-mainframe environments that automation cannot bridge. Let’s assume that things have gotten better, and that 90% is low and it’s closer to 95% to 98%. The last 2-5% suck. It’s not just that you need to port them by hand. It’s that they cannot be ported at all and require rearchitecting your application.
What kinds of things? I’ll let Hacker News user kochbeck give examples:
From the stock OS, can you request a write that guarantees lockstep execution across multiple cores and cross-checks the result? No. Can you request multisite replication of the action and verified synchronous on-processor execution (not just disk replication) at both sites such that your active-active multisite instance is always in sync? No. Can you assume that anything written will also stream to tape / cold storage for an indelible audit record? No. Can you request additional resources from the hypervisor that cost more money from the application layer and signal the operator for expense approval? No.
These are built-in capabilities that require engineering effort to duplicate in Java.
Problem three: verification
The core problem: How do you know that your 60,000,000 of COBOL performs the exact identical work as the x0,000,000 lines of Java that you generated?
Put another way, how do you know the “after” system works the same as the “before” system? There are repeatable and practical approaches to this. You could break the conversion down into stages, either separated by application or following the way data flows through the system. For each stage, you write the replacement, run them side-by-side, and investigate any differences until you are satisfied that they are equivalent. Then you cut to the new system and begin the next stage.
While you do that, you write and execute unit tests — preferably data-driven unit tests that execute against both environments — so that you can ensure that both systems behave identically.
You can’t do these in a few months. Especially not with a program like Social Security that has things like year-end summaries, monthly payments, etc. It would simply take too long to wait for these to happen to verify their behavior. If you really want to hit a tight deadline, it makes more sense to “lift-and-shift” onto the new system entirely and then fix problems as they happen.
Problem four: COBOL and Java use different computing paradigms and have different abstractions
Have you ever used code that was transpiled from one language to another? Then you know the output is terrible. LLMs can certainly fix some of the most consistent problems like “the machine-generated names look hideous,” but ultimately it will not be the code you would have written as a Java programmer.
Plus throw in the fun facts like “COBOL has fixed-point numbers” and “Java has exceptions” and otherwise-straightforward conversions are suddenly more complicated.
Do you need an example? Look at an example COBOL program. Literally procedural programming. Now imagine this line-for-line being converted into a Java program. The most obvious way to convert this is to cram the entire implementation into static functions, which is not how Java is written.
Imagine parachuting a Java engineer down into this codebase. They’d take off their chute, look around, and say “What on earth? Why is everything done this way? None of this is Java.”
Problem five: operations
The current system is COBOL on mainframe hardware. DOGE is considering Java for the new system. Every single difference between these systems could be relevant to whether it functions correctly:
COBOL compiles to machine code and Java compiles to JVM bytecode that is just-in-time compiled at runtime. These have different performance characteristics. Compiled languages typically start and run faster, but the JVM can use runtime information to make better optimizations than a static compiler can. They could easily need to rearchitect parts of the application that depended on a compiled language’s characteristics.
If they do Java on the mainframe, then they will need to learn how to productionize applications on something like z/OS, since most of their experience will certainly be on Linux. Is it POSIX? Yes. Is it what they’re used to? No. Will there be a learning curve? Almost definitely.
If they run Java on more modern hardware1, they may need to split the applications across multiple machines. If you have data that must be synchronized across transactions being simultaneously executed on hundreds of cores — which is easy to imagine for a nation-wide application like Social Security — suddenly this turns into a distributed system problem. This is much harder than “we simply ran our LLM on the COBOL to produce equivalent Java.”
There are other more minor differences that still matter. JVM is garbage collected while COBOL requires manual memory management. The garbage collector introduces extra CPU and memory overhead. Additionally, my experience with larger JVM applications is that the default settings are always wrong. These introduce extra trial-and-error situations that can jeopardize a tight launch timeline.
COBOL experts could certainly explain more differences. But it doesn’t matter; a problem in any single bullet point can blow up the timeline. Does the old system spawn tons of processes to handle background tasks? Well, you’re going to eat JVM startup time over and over, and now you need to make them work in-process without taking the whole thing down.
Problem six: Lag time on bug reports
If there are any problems, they won’t appear immediately. Imagine someone’s check doesn’t arrive. They might not even say anything for a month. But then they don’t get a check the next month, and they finally contact their local Social Security Administration branch. SSA triages and says, “hm, you should have gotten those checks.” Then somebody must debug the Java codebase and determine what happened to these checks. It could be months before it’s fixed.
I know that the “Starving” comic is the most likely outcome, at least at first. But someone will need to investigate it eventually if the Social Security program is still running at all; ignoring it only makes it harder to diagnose.
Appendix: “Joel on Software” history lesson
I’d like to highlight some prior art.
In “Things You Should Never Do, Part I”, Joel Spolsky argues that rewrites are born of hubris and will likely fail for many reasons. This was written before some of the youngest DOGE engineers were born, and it still holds up today2.
I’m not going to reproduce the full blog post here. Just go read it. But he makes the following points:
People want to rewrite because “It’s harder to read code than to write it.” Arguably true for a COBOL codebase.
There is no reason to believe you would do a better job than you did the first time. You don’t have more experience than the team that wrote the first version did when they started. You will make most of the same mistakes and introduce your own mistakes.
Rewrites throw away all of your bug fixes.
All new features are put on pause until the rewrite ships. Interesting to think that this can’t apply to Social Security, which might be required by law to implement changes. So they may additionally need to
It wastes an outlandish amount of money.
Most of the tech industry builds distributed networks out of smaller commodity hardware.
That being said, you should rewrite everything in Rust.
I am a 65 year old COBOL programmer. I also know Java. As somebody who never gambles, I would be willing to make an exception. There is NO way that a massive COBOL system can be converted to Java in a few months. Successfully that is. I'll take that bet. Let me toss out a better estimate. 8 years.
Once your tool generates the shitty Java code - and it will be profoundly shitty - the project managers will get bonus boners but will fail to realize the heavy lifting is still all in front of them. Testing. Testing. Testing. Develop test cases based on the specifications, assuming the specs are in good shape.
Take a look at the payroll system that was rewritten by IBM for the government of Canada. It did not go well. https://en.wikipedia.org/wiki/Phoenix_pay_system
Small nitpick. Garbage collection. COBOL does have a garbage collector of sorts but nobody calls it that. Memory is allocated automatically by the COBOL runtime when a program starts and typically discarded automatically when the program ends. Like Java programmers, COBOL programmers do not have to worry much about garbage collection. Of course there are always exceptions but I won't plunge into mad detail.
Here is my free advice. Don't touch it the old system.
Teach your programmers COBOL. It is stinky but easy to learn. Give them danger pay. The old, unsexy, stinky system has benefits that few people appreciate. We all know what they are. I won't rehash here.
Now form a small team to figure out how to gradually replace the monolith, bit by bit, piece by piece. Rearchitect the new system properly. Before you write a line of code.
I enjoyed your post. Well done.