This is a blog post by Pankhuri and Naveen Vardhi, of the BloomReach technical staff.
On August 9, 1995, Silicon Valley darling Netscape went public at a market value of $2.9 billion. Success is an understatement. It was a blockbuster and Netscape all but cornered the browser market. A little more than a decade later, Netscape was virtually irrelevant, left behind by Microsoft’s ascendence. Among the factors in Netscape’s downfall was a miscalculation that is more common than many realize: The company chose the cumbersome process of rewriting its code from scratch, rather than improving upon what it had already created.
In the blog “Things You Should Never Do”, Joel Spolsky calls rewriting code from scratch the single worst strategic mistake that any software company can make. However, what if you are stuck with archaic, monolithic and unmaintainable software — software that has reached its limits and is hampering you from making rapid progress?
There is a way. Let’s take a look at how BloomReach upgraded its software without taking on the headaches a total rewrite can bring.
Upgrading BloomReach Commerce Search API server
The Commerce Search REST API server is a multi-tenant system, having a single API layer. It serves requests from multiple features like personalized search, product recommendations, autosuggest and the merchandising dashboard.
This API server is designed to use Nginx + gunicorn as the web server with a django framework for web application development. To build API response, it connects to various back-end services, like SolrCloud, Personalization, Cassandra and Redis.
Multiple teams contributed to the codebase and tech debts accumulated over time. The result was a complex piece of software which was difficult to extend and maintain. Changes in one component posed the threat of breaking some other functionality. Because of this, we decided to upgrade our API server.
Original REST API server (v1)
There was a single endpoint in our API server for serving all the features and each feature had its own flow. For example, if a search request came to our API server, it was directed to a search handler, written specifically for processing search requests, which then served the response. Other features had a similar flow:
Issues with API server v1
The Commerce Search API was built about four years ago. Gradually, many features were added to it and it had to connect to a lot of services. Four or five different teams contributed to its codebase, making it complex. This server also carries technical debts gathered over the years. Our system slowly became difficult to maintain. Some of its characteristics follow.
- Complex system: It developed into a complex system where all services were accessed from a single point. Everything was tightly coupled; and while making changes in one service, there was always a threat that other services would suffer side-effects.
- Low code coverage: Given that multiple teams contributed, the scope of the unit tests became limited and code coverage gradually declined overtime. Moreover, tightly coupled code structure made it difficult for developers to write comprehensive tests.
- High learning curve: If a new developer wanted to fix a trivial bug, it required understanding the whole system and complete request flow.
- Not extensible: It had no plug-n-play support, making it difficult to extend existing code for new services.
- Code duplication: Since multiple teams contributed to the codebase, each team ended up writing its own code for similar functionality.
- Unnecessary processing in main thread: After serving the request, some post-processing was done in the main thread. It included posting metrics to monitoring tools like Graphite. Ideally any non-core functionality should be a separate thread.
- Poor profiling: It was difficult to add profiling for individual components, as code pieces for different services were coupled together.
New version of REST API server (v2)
Given below are the guiding principles for our new and shiny REST API server:
- High code coverage.
- Future proof and extensible service-oriented architecture.
- Easy to learn and modify.
- Extensive testing and profiling.
We came up with the following high-level design, which has two major components, one for request processing and the other for response processing. These two components are independent of each other and follow the Single Responsibility Principle. All the features follow this request flow.
CommerceSearchRequest (CSRequest) is a wrapper object around an http request. The request processing component populates it with the configs required for response processing. CSRequest holds the state required for response creation without re-visiting configs.
Response builders (handlers) create an initial response. Response components modify that response to create the final response.
Besides these two major components, we have config fetchers, which are used by request builders and components to create a CSRequest and data fetchers, which are used by response builders and components to create a response.
Why preferred refactoring over rewriting?
There are two options for upgrading software: either rewrite the whole code from scratch or incrementally refactor it. The scenario can be compared to waterfall vs agile models of software development. We decided to go with refactoring because:
- Reinventing the wheel: Although v1 is arduous to maintain, a lot of efforts and thoughts have been put to develop all the functionalities and they work as expected. Rewriting will require re-execution of the entire software development cycle.
- Time: Rewriting from scratch would have been a multi-quarter project.
- Potential bugs: Not all functionality in the server is well documented. There is a risk of missing out on such functionality during the rewrite. It would be discovered later as a bug.
- Migration: In refactoring, incremental changes can be deployed, whereas in the case of a rewrite, we cannot publish until all the features and functionalities have been developed.
- Scope of testing: When deploying incremental changes through refactoring, we can always do extensive testing and QA of our incremental changes and deploy the code with confidence. But in the case of rewriting, since all the changes are deployed at once, there is a higher chance of missing out on a bug.
Problems faced in refactoring
Here are few problems we faced while refactoring our code:
- Python — scripting vs objects as first class citizens: Our code was predominantly written in Python and since multiple developers contributed, it lacked uniformity. Some people even used it as scripting language with large functions and hardly any use of data classes. Refactoring such a code was cumbersome.
- Testing: During refactoring, we were adding new test cases, however in the initial phases we had to work on building new end-to-end testing methods, i.e. testing the entire flow of requests. For this, we replayed production logs against the two different versions of code. If there was no difference in the responses, we knew there were no bugs in the refactored code. We also took care of the multi-tenant aspect of our system by testing every feature type for each merchant.
- Backward compatibility: While changing the design of existing functionality we always had to make sure that any incremental change is backward compatible, so that new piece of code could be easily turned on or off in production without requiring any release revert. The whole process taught us the importance of backward compatibility in software development.
Refactoring achieved the intended benefits with lower risk
After refactoring, our system achieved the following benefits:
- Proper profiling: Commerce Search services are latency sensitive. Having proper profiling helps in identifying issues.
- Adding new services became easier: We wanted to build a new feature in our REST API server – autosuggest with product suggestions. With our simplified code, we were able to build this feature in a week. With the old design, it would have taken a month. Moreover, this is a time-sensitive service, having latency around one-tenth of search service. Since our new version was based on a plug-n-play model, it was easy to add only necessary components for this feature in order to minimize latency.
- Low learning curve: Fixing a trivial bug by a new developer requires minimum understanding of request flow.
- Legacy code: Got rid of complex and duplicate code.
- Coverage: We have added unit and integration tests with greater than 80 percent code coverage.
- SOLID: Our REST API server now follows the five SOLID principles.
- Segregating post-process: Various post-processing tasks, which didn’t affect our serving but were part of our main thread, were segregated to the background process.
- Latency reduction: With simplified code and by removing post-processing from the main thread, we reduced the latency of our service.
In this blog, we have shown how we refactored our complex REST API code into a much more simplified version. It was done in an incremental fashion and only took four months. Like the Ship of Theseus, refactoring slowly changed the entire API server while retaining its soul. In this span, we deployed multiple weekly releases to production without any outage. Rewriting would have taken much longer, with a higher risk of production outages.