Whatever You Do, Never Stop The World to Refactor A Project

Joel Spolsky called it “the single worst strategic mistake that any software company can make“. Any experienced engineer has felt the dread that they either messed up or the product has changed to the point where the original design will not work anymore. There are two approaches that most teams use to resolve this:

Stop The World
Stop everything, refactor the code, bring everything online all at once with the cleaner, better running code. The goal is to get the best possible system online faster due to not having to build and support the system at the same time.
Incremental Refactor
Keep the project progressing and selectively refactor segments of the project and bring them online. The goal is keep people working at the expense of a slower implementation.

I argue that stopping the world will almost certainly going to lead to failure.

Famous Stop The World Failures

Microsoft WinFS in Windows Vista
Microsoft attempted to bring Windows FileSystem AKA WinFS to market to replace their aging NTFS file system with a new relational filesystem. This would allow relations between files to be fetched easily and was based on the SQL Server code base. The result? Windows Vista file performance was reportedly 30% slower and a very late.
Netscape 4.0 to 6.0
Netscape took 3 years to rewrite version 4.0 from the ground up, skipping v5 all together. As a result, their browser share continued to plummet and resulted in the death of Netscape as a browser. Mozilla was the benefactor of this refactor and didn’t get a release out until 2002. Way after the death of Netscape.
Source: Wikipedia Wikipedia

Stop The World Most Often Leads to Project Failure

There are several reasons why stop the world is more likely to cause a project to fail than doing an incremental approach.

It devalues the work of non-engineers
Stopping the world to refactor a system puts the entire project on hold while engineers sort out how to support all the functions that are needed. This is a crucial mistake as it prevents others from making progress on the project during the downtime. If work stops, it’s a day for day schedule slip which in my opinion is not acceptable.
It will take longer than you think
Just like any construction project, it will always take longer than you think. Code that is centered around how the old system used to work will likely break due to assumptions on how the old system used to work. Be prepared to write a lot more code than just the section being refactored.
It will probably end up with the original sins of the original code
The refactor will probably swap out a set of problems for a new set of problems. I’ve personally committed this mistake myself thinking that it will be better in the new system, it’s almost always not. On top of that, the cost of switching over makes it even more difficult. Imagine running an online service and having to deprecate the old system and move everyone over to the new one. It’s already difficult to re-write the thing, now migration comes into play as well. Ugh.

Cases Where Stop The World Could Work

There are a few times where Stop The World refactoring could work in my opinion, but they are generally very narrow.

Criteria To Use Stop The World Refactor

Short, Scheduled Interruption To the Project (usually less than a week)
Clear scope of the refactor to be completed
Clear + Demonstrable success (unit tests, integration tests, etc)

Even if you meet all three of these criteria, you should just do an incremental refactor anyways.