Safely migrating millions of API requests

  • By Andy Smith
  • 10 March 2020

In December we successfully flipped around half a billion monthly API requests from our Ruby on Rails application to some new Python 3 applications.

Now that the dust has settled, and we’re comfortable that all has gone well, I wanted to write up the details of the project, give a bit of a history of Engineering at Tessian, and share some lessons learned in the hope that others may benefit from mistakes we’ve made.

In the beginning, there was Rails

Long before Tessian became what it is today, most of our code base for our backend infrastructure was written in Ruby on Rails.

This was the right choice of technology at the time; it allowed us to produce a reliable product while iterating quickly. But as we grew it became apparent that being able to share production code with our data science team (who predominantly work in Python) would allow us to move much more quickly.

That was when we decided to build out some core backend functionality using Python 3. This would allow our backend code to lean heavily on various open source tooling for data science and machine learning.

That decision was made 3 years ago and today, in hindsight, it still looks like the right call. However, and you may be ahead of me already here, deciding to start using Python did not magically get rid of all the Ruby code we had already written.

That was the situation one of our Engineering teams found themselves pondering in August last year. 

Deciding to migrate

At Tessian our teams have themes to help define their place in the world. Themes are mini mission statements that ladder up to Tessian’s greater mission to secure the human layer.

One team’s theme is “Tessian’s stellar security reputation aids growth”. Since the team’s inception, they had been focusing on building security features in our Python code, where a lot of backend development and data processing takes place.

While debating the most important thing to work on next, we decided to dig in to some data. This showed that 500,000 API requests an hour were being handled by our Ruby on Rails application server. Looking at that number, coupled with the fact that we had grown the Engineering team 100% in the past year and hired 0 Ruby developers, it quickly became apparent that this part of our code base needed some attention.

The following factors ultimately contributed to our decision:

  • The proportion of Ruby experts in the company was depleting.
  • Improvements to code linting, security frameworks were getting added to Python and not Ruby.
  • Our Ruby app had not kept up with improvements we had made to monitoring and alerting.
  • Ruby was the original code base and contained some of the oldest and least well understood code in the company.
  • There were some tickets in our backlog around poor performance of some of the Ruby endpoints, meaning future development of them was likely.

So the decision was made: we would port the existing Ruby APIs to our Python code base, allowing them to make use of our latest frameworks and practices.

Path to migration

Given the high volume of traffic and the importance of the APIs we wanted to ensure that we kept risk to a minimum when porting them. With this code came other challenges such as poorly defined interfaces and many different client versions. 

After a few whiteboard sessions, we settled on a phased approach that went as follows.

Phase 0 – Existing setup

The original setup – clients talking to our Ruby application.

Phase 1 – Transparent proxying

First we built a new Python application to transparently proxy traffic to our Ruby application. We slotted this in to the API flow. Because it just proxied traffic, this was a relatively safe operation.

Phase 2 – Response generation and comparison

The next step (and we did this for each API that we migrated) was to implement the API and use live production data to compare “what we would have sent”, being generated by the Python App, (new_response) with “what we should have and did send” (old_response).

By comparing the responses, we could catch errors in the implementation based on live production data and fix them.

Note that this was not perfect; most of our APIs mutated state in a database in a way that was not idempotent. This meant that we never wanted both Ruby and Python to affect the database – it was either one or the other.

Phase 3 – Switchover

Once we had confidence that the response being returned was correct, it was time to stop using Ruby to affect the database and start using Python.

Note that because we did not want conflicts between Ruby and Python both altering database state, as we switched over, we stopped calling Ruby.

As mentioned above, this had not been tested on production, so still had with it some risk. So we did this in a staged approach, first routing 10% of traffic, then 50% then 100%. Focusing first on our internal dog food tenancy.

And that was it! Once we had done this for all APIs the amount of Ruby code in production was drastically reduced.

Retrospective

At Tessian we believe that we will build the best teams and product by being open about when things go wrong; we also believe in creating a blameless culture.

We suffered one incident as a result of this migration. This had a very limited impact on customers. It caused a minor degradation to our web UI only – not our predictions. We were able to retroactively fix the symptoms, but this was something that we took seriously.

The issue was that one of our new Python APIs did not update a database column that the Ruby APIs previously did.

In hindsight we think that our comparison framework gave us a false sense of security in the code porting. Our key takeaway was that next time we will have to be more conscious of what it would catch (us breaking our APIs) and what it would not (incorrect database updates). 

If we were to do it again, we would do it mostly the same way, but with more thorough code review on these components.

All in all we consider the project to be a success and believe that it will aid Tessian’s stellar security reputation thanks to the amazing hard work of the all engineers who worked on this project!

Andy Smith Head of Engineering