Request a Demo of Tessian Today.
Automatically stop data breaches and security threats caused by employees on email. Powered by machine learning, Tessian detects anomalies in real-time, integrating seamlessly with your email environment within minutes and starting protection in a day. Provides you with unparalleled visibility into human security risks to remediate threats and ensure compliance.
Engineering Team
After 18 Months of Engineering OKRs, What Have We Learned?
By Andy Smith
03 June 2021
We have been using OKRs (Objectives and Key Results) at Tessian for over 18 months now, including in the Engineering team. They’ve grown into an essential part of the organizational fabric of the department, but it wasn’t always this way. In this article I will share a few of the challenges we’ve faced, lessons we’ve learned and some of the solutions that have worked for us. I won’t try and sell you on OKRs or explain what an OKR is or how they work exactly; there’s lots of great content that already does this! Getting started When we introduced OKRs, there were about 30 people in the Engineering department. The complexity of the team was just reaching the tipping point where planning becomes necessary to operate effectively. We had never really needed to plan before, so we found OKR setting quite challenging, and we found ourselves taking a long time to set what turned out to be bad OKRs. It was tempting to think that this pain was caused by OKRs themselves. On reflection today, however, it’s clear that OKRs were merely surfacing an existing pain that would emerge at some point anyway. If teams can’t agree on an OKR, they’re probably not aligned about what they are working on. OKRs surfaced this misalignment and caused a little pain during the setting process that prevented a large pain during the quarter when the misalignment would have had a larger impact. The Key Result part of an OKR is supposed to describe the intended outcome in a specific and measurable way. This is sometimes straightforward, typically when a very clear metric is used, such as revenue or latency or uptime. However, in Engineering there are often KRs that are very hard to write well. It’s too easy to end up with a bunch of KRs that drive us to ship a bunch of features on time, but have no aspect of quality or impact. The other pitfall is aiming for a very measurable outcome that is based on a guess, which is what happens when there is no baseline to work from. Again, these challenges exist without OKRs, but they may never precipitate into the conversation about what a good outcome is for a particular deliverable without OKRs there to make it happen. Unfortunately we haven’t found the magic wand that makes this easy, and we still have some binary “deliver the feature” key results every quarter, but these are less frequent now. We will often set a KR to ship a feature in Q1 and to set up a metric and will then set a target for the metric in Q2 once we have a baseline. Or if we have a lot of delivery KRs, we’ll pull them out of OKRs altogether and zoom out to set the KR around their overall impact. An eternal debate in the OKR world is whether to set OKRs top-down (leadership dictate the OKRs and teams/individuals fill out the details) or bottom-up (leadership aggregates the OKRs of teams and individuals into something coherent) or some mixture of the two. We use a blend of the two, and will draft department OKRs as a leadership team and then iterate a lot with teams, sometimes changing them entirely. This takes time, though. Every iteration uncovers misalignment, capacity, stakeholder or research issues that need to be addressed. We’ve sometimes been frustrated and rushed this through as it feels like a waste of time, but when we’ve done this, we’ve just ended up with bigger problems later down the road that are harder to solve than setting decent OKRs in the first place. The lesson we’ve learned is that effort, engagement with teams and old-fashioned rigor are required when setting OKRs, so we budget 3-4 weeks for the whole process. Putting OKRs into Practice The last three points have all been about setting OKRs, but what about actually using them day to day? We’ve learned two things:  the importance of allowing a little flex, and  how frequent – but light – process is needed to get the most out of your OKRs First, flex. Our OKRs are quarterly, but sometimes we need to set a 6 month OKR because it just makes more sense! We encourage this to happen. We don’t obsess about making OKRs ladder up perfectly to higher-level OKRs. It’s nice when they do, but if this is a strict requirement, then we find that it’s hard to make OKRs that actually reflect the priorities of the quarter. Sometimes a month into the quarter, we realize we set a bad OKR or wrote it in the wrong way. A bit of flexibility here is important, but not too much. It’s important to learn from planning failures, but it is probably more important that OKRs reflect teams’ actual priorities and goals, or nobody is going to take them seriously. So tweak that metric or cancel that OKR if you really need to, but don’t go wild. Finally, process. If we don’t actively check in on OKRs weekly, we tend to find that all the value we get from OKRs is diluted. Course-corrections come too late or worries go unsolved for too long. To keep this sustainable, we do this very quickly. I have an OKR check-in on the agenda for all my 1-1s with direct reports, and we run a 15-minute group meeting every week with the Product team where each OKR owner flags any OKRs that are off track, and we work out what we need to do to resolve them. Often this causes us to open a slack channel or draft a document to solve the issue outside of the meeting so that we stick to the strict 15 minute time slot. Many of these lessons have come from suggestions from the team, so my final tip is that if you’re embarking on using OKRs in your Engineering team, or if you need to get them back on track, make sure you set some time aside to run a retrospective. This invites your leaders and managers to think about the mechanics of OKRs and planning, and they usually have the best ideas on how to improve things.
Engineering Team
Tessian’s Client Side Integrations QA Journey – Part I
By Craig Callender
20 May 2021
In this series, we’re going to go over the Quality Assurance journey we’ve been on here in the Client Side Integrations (CSI) team at Tessian. Most of this post will be using our experience with the Outlook Add-in, as that’s the piece of software most used by our clients. But the philosophies and learnings here apply to most software in general (regardless of where it’s run) with an emphasis on software that includes a UI. I’ll admit that the onus for this work was me sitting in my home office Saturday morning knowing that I’d have to start manual testing for an upcoming release in the next two weeks and just not being able to convince myself to click the “Send” button in Outlook after typing a subject of “Hello world” one more time… But once you start automating UI tests, it just builds on itself and you start being able to construct new tests from existing code. It can be such a euphoric experience. If you’ve ever dreaded (or are currently dreading) running through manual QA tests, keep reading and see if you can implement some of the solutions we have. Why QA in the Outlook Add-in Ecosystem is Hard The Outlook Add-in was the first piece of software written to run on our clients’ computers and, as a result of this, alongside needing to work in some of the oldest software Microsoft develops (Outlook), there are challenges when it comes to QA. These challenges include: Detecting faults in the add-in itself Detecting changes in Outlook which may result in functionality loss of our add-in Detecting changes in Windows that may result in performance issues of our add-in Testing the myriad of environments our add-in will be installed in The last point is the hardest to QA, as even a list of just a subset of the different configurations of Outlook shows the permutations of test environments just doesn’t scale well: Outlook’s Online vs Cached mode Outlook edition: 2010, 2013, 2016, 2019 perpetual license, 2019 volume license, M365 with its 5 update channels… Connected to On-Premise Exchange Vs Exchange Online/M365 Other add-ins in Outlook Third-party Exchange add-ins (Retention software, auditing, archiving, etc…) And now add non-Outlook related environment issues we’ve had to work through: Network proxies, VPNs, Endpoint protection Virus scanning software Windows versions One can see how it would be impossible to predict all the environmental configurations and validate our add-in functionality before releasing it. A Brief History of QA in Client Side Integrations (CSI) As many companies do – we started our QA journey with a QA team.  This was a set of individuals whose full time job was to install the latest commit of our add-in and test its functionality. This quickly grew where this team was managing/sharing VMs to ensure we worked on all those permutations above. They also worked hard to try and emulate the external factors of our clients’ environments like proxy servers, weak Internet connections, etc… This model works well for exploratory testing and finding strange edge cases, but where it doesn’t work well or scale well, is around communication (the person seeing the bug isn’t the person fixing the bug) and automation (every release takes more and more person-power as the list of regression issues gets longer and longer). In 2020 Andy Smith, our Head of Engineering, made a commitment that all QA in Tessian would be performed by Developers. This had a large impact on the CSI team as we test an external application (Outlook) across many different versions and configurations which can affect its UI. So CSI set out a three phase approach for the Development team to absorb the QA processes. (Watch how good we are at naming things in my team.) Short-Term The basic goal here was that the Developers would run through the same steps and processes that were already defined for our QA.  This meant a lot of manual testing, manually configuring environments, etc. The biggest learning from our team during this phase was that there needs to be a Developer on an overview board whenever you have a QA department to ensure that test steps actually test the thing you want. We found many instances where an assumption in a test step was made that was incorrect or didn’t fully test something. Medium-Term The idea here was that once the Developers are familiar and comfortable running through the processes defined by the QA department, we would then take over ownership of the actual tests themselves and make edits. Often these edits resulted in the ability to test a functionality with more accuracy or less steps. It also included the ability to stand up an environment that tests more permutations, etc. The ownership of the actual tests also meant that as we changed the steps, we needed to do it with automation in mind. Long-Term Automation. Whether it’s unit, integration, or UI tests, we need them automated. Let a computer run the same test over and over again let the Developers think of ever increasing complexity of what and how we test. Our QA Philosophy Because it would be impossible for us to test every permutation of potential clients’ environments before we release our software (or even an existing client’s environment), we approach our QA with the following philosophies: Software Engineers are in the Best Position to Perform QA This means that the people responsible for developing the feature or bug, are the best people when it comes to writing the test cases needed to validate the change, add those test cases to a release cycle, and to even run the test itself.  The why’s of this could be (and probably will be) a whole post. 🙂 Bugs Will Happen We’re human. We’ll miss something we shouldn’t have. We won’t think of something we should have.  On top of that, we’ll see something we didn’t even think was a possibility. So be kind and focus on the solution rather than the bad code commit. More Confidence, Quicker Our QA processes are here to give us more confidence in our software as quickly as possible, so we can release features or fixes to our clients. Whether we’re adding, editing, or removing a step in our QA process, we ask ourselves if doing this will bring more confidence to our release cycle or will it speed it up.  Sometimes we have to make trade-offs between the two. Never Release the Same Bug Twice Our QA process should be about preventing regressions on past issues just as much as it is about confirming functionality of new features. We want a robust enough process that when an issue is encountered and solved once, that same issue is never found again.  In the least, this would mean we’d never have the same bug with the same root cause again.  At most it would mean that we never see the same type of bug again, as a root cause could be different even though the loss in functionality is the same. An example of this last point is that if our team works through an issue where a virus scanner is preventing us from saving an attachment to disk, we should have a robust enough test that will also detect this same loss in functionality (the inability to save an attachment to disk) for any cause (for example, a change to how Outlook allows access to the attachment, or another add-in stubbing the attachment to zero-bytes for archiving purposes, etc…) How Did We Do? We started this journey with a handful of unit tests that were all automated in our CI environment.   Short-Term Phase During the Short-Term phase, there was an emphasis on new commits ensuring that we had unit tests alongside them.  Did we sometimes make a decision to release a feature with only manual tests because the code base didn’t lend itself to unit testability? YES! But we worked hard to always ensure we had a good reason for excluding unit tests instead of just assuming it couldn’t be done because it hadn’t before. Being flexible, while at the same time keeping your long-term goal in mind is key, and at times, challenging. Medium-Term This phase wasn’t made up of nearly as much test re-writing as we had intentionally set out for.  We added a section to our pull requests to include links to any manual testing steps required to test the new code. This resulted in more new, manual tests being written by developers than edits to existing ones. We did notice that the quality of tests changed.  It’s tempting to say, “for the better”, “or with better efficiency”, but I believe most of the change can more be attributed to an understanding that the tests were now being written for a different audience, namely Developers.  They became a bit more abstract and a bit more technical.  Less intuitive. They also became a bit more verbose as we get a bad taste in our mouth whenever we see a manual step that says something like, “Trigger an EnforcerFilter” with no description on which one? One that displays something to the user or just the admin? Etc…. This phase was also much shorter than we had originally thought it would be. Long-Term This was my favorite phase.  I’m actually one of those software engineers that LOVE writing unit tests. I will attribute this to JetBrains’ ReSharper (I could write about my love of ReSharper all day) interface which gives me oh-so-satisfying green checkmarks as my tests run… I love seeing more and more green checkmarks! We had arranged a long term OKR with Andy, which gave us three quarters in 2021 to implement automation of three of our major modules (Constructor, Enforcer, and Guardian)— with a stretch goal of getting one test working for our fourth major module, Defender.  We blew this out of the water and met them all (including a beta module – Architect) in one quarter.  It was addicting writing UI tests and watching the keyboard and mouse move on its own. Wrapping it All Up Like many software product companies large and small, Tessian started out with a manual QA department composed of technologists but not Software Engineers.  Along the way, we made the decision that Software Engineers need to own the QA of the software they work on. This led us on a journey, which included closer reviews of existing tests, writing new tests, and finally automating a majority of our tests. All of this combined allows us to release our software with more confidence, more quickly. Stay tuned for articles where we go into details about the actual automation of UI tests and get our hands dirty with some fun code.
Engineering Team
How Do You Encrypt PySpark Exceptions?
By Vladimir Mazlov
14 May 2021
We at Tessian are very passionate about the safety of our customers. We constantly handle sensitive email data to improve the quality of our protection against misdirected emails, exfiltration attempts, spear phishing etc. This means that many of our applications handle data that we can’t afford to have leaked or compromised. As part of our efforts to keep customer data safe, we take care to encrypt any exceptions we log, as you never know when a variable that has the wrong type happens to contain an email address. This approach allows us to be liberal with the access we give to the logs, while giving us comfort that customer data won’t end up in them. Spark applications are no exception to this rule, however, implementing encryption for them turned out to be quite a journey. So let us be your guide on this journey. This is a tale of despair, of betrayal and of triumph. It is a tale of PySpark exception encryption.
Problem statement Before we enter the gates of darkness, we need to share some details about our system so that you know where we’re coming from. The language of choice for our backend applications is Python 3. To achieve exception encryption we hook into Python’s error handling system and modify the traceback before logging it. This happens inside a function called init_logging() and looks roughly like this:
We use Spark 2.4.4. Spark Jobs are written entirely in Python; consequently, we are concerned with Python exceptions here. If you’ve ever seen a complete set of logs from a YARN-managed PySpark cluster, you know that a single ValueError can get logged tens of times in different forms; our goal will be to make sure all of them are either not present or encrypted. We’ll be using the following error to simulate an exception raised by a Python function handling sensitive customer information:
Looking at this, we can separate the problem into 2 parts: the driver and the executors. The executors Let’s start with what we initially (correctly) perceived to be the main issue. Spark Executors are a fairly straightforward concept until you add Python into the mix. The specifics of what’s going on inside are not often talked about and are relevant to the discussion at hand, so let’s dive in.
All executors are actually JVMs, not python interpreters, and are implemented in Scala. Upon receiving Python code that needs to be executed (e.g. in rdd.map) they start a daemon written in Python that is responsible for forking the worker processes and supplying them with means of talking to the JVM, via sockets.  The protocol here is pretty convoluted and very low-level, so we won’t go into too much depth. What will be relevant to us are two details; both have to do with communication between the driver and the JVM: The JVM executor expects the daemon to open a listening socket on the loopback interface and communicate the port back to it via stdout The worker code contains a general try-except that catches any errors from the application code and writes the traceback to the socket that’s read by the JVM Point 2 is how the Python exceptions actually get to the executor logs, which is exactly why we can’t just use init_logging, even if we could guarantee that it was called: Python tracebacks are actually logged by Scala code! How is this information useful? Well, you might notice that the daemon controls all Python execution, as it spawns the workers. If we can make it spawn a worker that will encrypt exceptions, our problems are solved. And it turns out Spark has an option that does just that: spark.python.daemon.module. This solution actually works; the problem is it’s incredibly fragile: We now have to copy the code of the driver, which makes spark version updates difficult Remember, it communicates the port to the JVM via stdout. Anything else written to stdout (say, a warning output by one of the packages used for encryption) destroys the executor:
As you can probably tell by the level of detail here, we really did think we could do the encryption on this level. Disappointed, we went one level up and took a look at how the PythonException was handled in the Scala code. Turns out it’s just logged on ERROR level with the Python traceback received from the worker treated as the message. Spark uses log4j, which provides a number of options to extend it; Spark, additionally, provides the option to override log processing using its configuration.  Thus, we will have achieved our goal if we encrypted the messages of all exceptions on log4j level. We did it by creating a custom RealEncryptExceptionLayout class that simply calls the default one unless it gets an exception, in which case it substitutes it with the one with an encrypted message. Here’s how it broadly looks:
To make this work we shipped this as a jar to the cluster and, importantly, specified the following configuration:
And voila! The driver: executor errors by way of Py4J Satisfied with ourselves, we decided to grep the logs for the error before moving on to errors in the driver. Said moving on was not yet to be, however, as we found the following in the driver’s stdout:
This not only is incredibly readable but also not encrypted! This exception, as you can very easily tell, is thrown by the Scala code, specifically DAGScheduler, when a task set fails, in this case due to repeated task failures.  Fortunately for us, as illustrated by the diagram above, the driver simply runs python code in the interpreter that, as far as it’s concerned, just happens to call py4j APIs that, in turn, communicate with the JVM. Thus, it’s not meaningfully different from our backend applications in terms of error handling, so we can simply reuse the init_logging() function. If we do it and check the stdout we see that it does indeed work:
The driver: executor errors by way of TaskSetManager Yes, believe it or not, we haven’t escaped the shadow of the Executor just yet. We’ve seen our fair share of the driver’s stdout. But what about stderr? Wouldn’t any reasonable person expect to see some of those juicy errors there as well? We pride ourselves on being occasionally reasonable, so we did check. And lo and behold:
Turns out there is yet another component that reports errors from the executors: TaskSetManager; our good friend DAGScheduler also logs this error when a stage crashes because of it. Both of them, however, do this while processing events initially originating in the executors; where does the traceback really come from? In a rare flash of logic in our dark journey, from the Executor class, specifically the run method:
Aha, there’s a Serializer here! That’s very promising, we should be able to extend/replace it to encrypt the exception before actual serialization, right? Wrong. In fact, to our dismay, that used to be possible but was removed in version 2.0.0 (reference: https://issues.apache.org/jira/browse/SPARK-12414). Seeing as how nothing is configurable at this end, let’s go back to the TaskSetManager and DAGScheduler and note that the offending tracebacks are logged by both of them. Since we are already manipulating the logging mechanism, why not go further down that lane and encrypt these logs as well? Sure, that’s a possible solution. However, both log lines, as you can see in the snippet, are INFO. To find out that this particular log line contains a Python traceback from an executor we’d have to modify the Layout to parse it. Instead of doing that and risking writing a bad regex (a distinct possibility as some believe a good regex is an animal about as real as a unicorn) we decided to go for a simple and elegant solution. We simply don’t ship the .jar containing the Layout to the driver; like we said, elegant. That turns out to have the following effect:
And that’s all that we find in the stderr! Which suits us just fine, as any errors from the driver will be wrapped in Py4J, diligently reported in the stdout and, as we’ve established, encrypted. The driver: python errors That takes care of the executor errors in the driver. But the driver is nothing to sniff at either. It can fail and log exceptions just as well, can’t it? As you have probably already guessed, this isn’t really a problem. After all, the driver is just running python code, and we’re already calling init_logging(). Satisfyingly enough it turns out to work as one would expect. For these errors we again need to look at the driver’s stdout. If we raise the exception in the code executed in the driver (i.e. the main function) the stdout normally contains:
Calling init_logging() turns this traceback into:
Conclusion And thus our journey comes to an end. Ultimately our struggle has led us to two realizations; neither is particularly groundbreaking, but both are important to understand when dealing with PySpark: Spark is not afraid to repeat itself in the logs, especially when it comes to errors. PySpark specifically is written in such a way that the driver and the executors are very different. Before we say our goodbyes we feel like we must address one question: WHY? Why go through with this and not just abandon this complicated project?  Considering that the data our Spark jobs tend to handle is very sensitive, in most cases it is various properties of emails sent or received by our customers. If we give up on encrypting the exceptions, we must accept that this very sensitive information could end up in a traceback, at which point it will be propagated by Spark to various log files. The only real way to guarantee no personal data is leaked in this case is to forbid access to the logs altogether. And while we did have to descend into the abyss and come back to achieve error encryption, debugging Spark jobs without access to logs is inviting the abyss inside yourself.
Engineering Team
How We Improved Developer Experience in a Rapidly Growing Engineering Team
By Andy Smith
16 April 2021
Developer experience is one of most important things for a Head of Engineering to care about. Is development safe and fast? Are developers proud of their work? Are our processes enabling great collaboration and getting the best out of the team?  But sometimes, developer experience doesn’t get the attention it deserves. It is never the most urgent problem to solve, there are lots of different opinions about how to make improvements, and it seems very hard to measure.  At Tessian the team grows and evolves very quickly; we’ve gone from 20 developers to over 60 in just 3 years.  When the team was smaller, it was straightforward to keep a finger on the pulse of developer experience. With such a large and rapidly growing team, it’s all too easy for developer experience to be overshadowed by other priorities. At the end of 2020, it became clear that we needed a way to get a department-wide view of the perception of our developer experience that we could use to inform decisions and see whether those decisions had an impact. We decided one thing that would really help is a regular survey.  This would help us spot patterns quickly and it would give us a way to know if we were improving or getting worse. Most importantly it gives everyone in the team a chance to have their say and to understand what others are thinking.  Borrowing some ideas from Spotify, we sent the survey out in January to the whole Engineering team to get their honest, anonymized feedback. We’ll be repeating this quarterly.  Here are some of the high-level topics we covered in the survey. Speed and ease To better understand if our developers feel they can work quickly and securely, we asked the following questions: How simple, safe and painless is it to release your work? Do you feel that the speed of development is high? !function(e,t,s,i){var n="InfogramEmbeds",o=e.getElementsByTagName("script"),d=o[0],r=/^http:/.test(e.location)?"http:":"https:";if(/^\/{2}/.test(i)&&(i=r+i),window[n]&&window[n].initialized)window[n].process&&window[n].process();else if(!e.getElementById(s)){var a=e.createElement("script");a.async=1,a.id=s,a.src=i,d.parentNode.insertBefore(a,d)}}(document,0,"infogram-async","//e.infogram.com/js/dist/embed-loader-min.js");
You can see we got a big spread of answers, with quite a few detractors. We looked into this more deeply and identified that the primary driver for this is that some changes cannot be released independently by developers; some changes have a dependency on other teams and this can slow down development.  We’d heard similar feedback before running the survey which had led us to start migrating from Amazon ECS to Kubernetes. This would allow our Engineering teams to make more changes themselves. It was great to validate this strategy with results from the survey. More feedback called out a lack of test automation in an important component of our system.  We weren’t taking risks here, but we were using up Engineering time unnecessarily. This led to us deciding to commit to a project that would bring automation here. This has already led to us finding issues 15x quicker than before:
Autonomy and satisfaction We identified two areas of strength revealed by asking the following questions: How proud are you of the work you produce and the impact it has for customers? How much do you feel your team has a say in what they build and how they build it? !function(e,t,s,i){var n="InfogramEmbeds",o=e.getElementsByTagName("script"),d=o[0],r=/^http:/.test(e.location)?"http:":"https:";if(/^\/{2}/.test(i)&&(i=r+i),window[n]&&window[n].initialized)window[n].process&&window[n].process();else if(!e.getElementById(s)){var a=e.createElement("script");a.async=1,a.id=s,a.src=i,d.parentNode.insertBefore(a,d)}}(document,0,"infogram-async","//e.infogram.com/js/dist/embed-loader-min.js");
These are two areas that we’ve always worked very hard on because they are so important to us at Tessian. In fact, customer impact and having a say in what is built are the top two reasons that engineers decide to join Tessian.  We’ve recently introduced a Slack channel called #securingthehumanlayer, where our Sales and Customer Success teams share quotes and stories from customers and prospects who have been wowed by their Tessian experience or who have avoided major data breaches (or embarrassing ‘Oh sh*t’ moments!).  We’ve also introduced changes to how OKRs are set, which gives the team much more autonomy over their OKRs and more time to collaborate with other teams when defining OKRs. Recently we launched a new product feature, Misattached File Prevention. Within one hour of enabling this product for our customers, we were able to share an anonymised story of an awesome flag that we’d caught.
What’s next? We’re running the next survey again soon and are excited to see what we learn and how we can make the developer experience at Tessian as great as possible.
Engineering Team
A Solution to HTTP 502 Errors with AWS ALB
By Samson Danziger
03 November 2020
At Tessian, we have many applications that interact with each other using REST APIs. We noticed in the logs that at random times, uncorrelated with traffic, and seemingly unrelated to any code we had actually written, we were getting a lot of HTTP 502 “Bad Gateway” errors. Now that the issue is fixed, I wanted to explain what this error means, how you get it and how to solve it. My hope is that if you’re having to solve this same issue, this article will explain why and what to do.  First, let’s talk about load balancing
In a development system, you usually run one instance of a server and you communicate directly with it. You send HTTP requests to it, it returns responses, everything is golden.  For a production system running at any non-trivial scale, this doesn’t work. Why? Because the amount of traffic going to the server is much greater, and you need it to not fall over even if there are tens of thousands of users.  Typically, servers have a maximum number of connections they can support. If it goes over this number, new people can’t connect, and you have to wait until a new connection is freed up. In the old days, the solution might have been to have a bigger machine, with more resources, and more available connections. Now we use a load balancer to manage connections from the client to multiple instances of the server. The load balancer sits in the middle and routes client requests to any available server that can handle them in a pool.  If one server goes down, traffic is automatically routed to one of the others in the pool. If a new server is added, traffic is automatically routed to that, too. This all happens to reduce load on the others.
What are 502 errors? On the web, there are a variety of HTTP status codes that are sent in response to requests to let the user know what happened. Some might be pretty familiar: 200 OK – Everything is fine. 301 Moved Permanently – I don’t have what you’re looking for, try here instead.  403 Forbidden – I understand what you’re looking for, but you’re not allowed here. 404 Not Found – I can’t find whatever you’re looking for. 503 Service Unavailable – I can’t handle the request right now, probably too busy. 4xx and 5xx both deal with errors.  4xx are for client errors, where the user has done something wrong. 5xx, on the other hand, are server errors, where something is wrong on the server and it’s not your fault.  All of these are specified by a standard called RFC7231. For 502 it says: The 502 (Bad Gateway) status code indicates that the server, while acting as a gateway or proxy, received an invalid response from an inbound server it accessed while attempting to fulfill the request. The load balancer sits in the middle, between the client and the actual service you want to talk to. Usually it acts as a dutiful messenger passing requests and responses back and forth. But, if the service returns an invalid or malformed response, instead of returning that nonsensical information to the client, it sends back a 502 error instead.  This lets the client know that the response the load balancer received was invalid.
The actual issue Adam Crowder has done a full analysis of this problem by tracking it all the way down to TCP packet capture to assess what’s going wrong. That’s a bit out of scope for this post, but here’s a brief summary of what’s happening: At Tessian, we have lots of interconnected services. Some of them have Application Load Balancers (ALBs) managing the connections to them.  In order to make an HTTP request, we must open a TCP socket from the client to the server. Opening a socket involves performing a three-way handshake with the server before either side can send any data.  Once we’ve finished sending data, the socket is closed with a 4 step process. These 3 and 4 step processes can be a large overhead when not much actual data is sent. Instead of opening and then closing one socket per HTTP request, we can keep a socket open for longer and reuse it for multiple HTTP requests. This is called HTTP Keep-Alive. Either the client or the server can then initiate a close of the socket with a FIN segment (either for fun or due to timeout).
The 502 Bad Gateway error is caused when the ALB sends a request to a service at the same time that the service closes the connection by sending the FIN segment to the ALB socket. The ALB socket receives FIN, acknowledges, and starts a new handshake procedure. Meanwhile, the socket on the service side has just received a data request referencing the previous (now closed) connection. Because it can’t handle it, it sends an RST segment back to the ALB, and then the ALB returns a 502 to the user. The diagram and table below show what happens between sockets of the ALB and the Server.
How to fix 502 errors It’s fairly simple.  Just make sure that the service doesn’t send the FIN segment before the ALB sends a FIN segment to the service. In other words, make sure the service doesn’t close the HTTP Keep-Alive connection before the ALB.  The default timeout for the AWS Application Load Balancer is 60 seconds, so we changed the service timeouts to 65 seconds. Barring two hiccoughs shortly after deploying, this has totally fixed it. The actual configuration change I have included the configuration for common Python and Node server frameworks below. If you are using any of those, you can just copy and paste. If not, these should at least point you in the right direction.  uWSGI (Python) As a config file: # app.ini [uwsgi] ... harakiri = 65 add-header = Connection: Keep-Alive http-keepalive = 1 ... Or as command line arguments: --add-header "Connection: Keep-Alive" --http-keepalive --harakiri 65 Gunicorn (Python) As command line arguments: --keep-alive 65 Express (Node) In Express, specify the time in milliseconds on the server object. const express = require('express'); const app = express(); const server = app.listen(80); server.keepAliveTimeout = 65000
Looking for more tips from engineers and other cybersecurity news? Keep up with our blog and follow us on LinkedIn.
Page
[if lte IE 8]
[if lte IE 8]