See our new Attack Masterclass Webinar: How to Beat the Phishing and Ransomware Surge  — Sign Up Now

Request a Demo of Tessian Today.
Automatically stop data breaches and security threats caused by employees on email. Powered by machine learning, Tessian detects anomalies in real-time, integrating seamlessly with your email environment within minutes and starting protection in a day. Provides you with unparalleled visibility into human security risks to remediate threats and ensure compliance.
Engineering Team

Tessian’s engineering team shares tips for solving complex problems. Get advice related to QAs, 502 errors, team management, and more.

Engineering Team
Tessian’s CSI QA Journey: WinAppDriver, Office Apps, and Sessions
By Tessian
30 June 2021
Introduction In part one, we went over the decisions that led the CSI team to start automating its UI application with a focus on the process drivers and journey.  Today we’re going to start going over the technical challenges, solutions, and learnings along the way.  It would be good if you had a bit of understanding of how to use WinAppDriver for UI testing.  As there are a multitude of beginner tutorials, this post will be more in depth. All code samples are available as a complete solution here. How We Got Here As I’m sure many others have done before, we started by adapting winappdriver samples into our own code base.  After we had about 20 tests up and running, it became clear that taking some time to better architect common operations would help in fixing tests as we targeted more versions of Outlook, Windows, etc.  Simple things like how long to wait for a window to open, or how long to wait to receive an email can be impacted by the test environment, and it quickly becomes tedious to change these in 20 different places whenever we have a new understanding/solution on the best way to do these operations. Application Sessions A good place to start when writing UI tests is just getting the tests to open the application.  There are plenty of samples online that show you how to do this, but there are a few things that the samples leave each of us to solve on our own that I think would be helpful to share with the larger Internet community. All Application Sessions are Pretty Similar And when code keeps repeating itself, it’s time to abstract this code into interfaces and classes.  So, we have both: an interface and a base class:
Don’t worry, we’ll get into the bits.  The main point of this class is it pertains to starting/stopping, or attaching/detaching to applications and that we’re storing enough information about the application under test to do those operations.   In the constructor, the name of the process is used to determine if we can attach to an already running process, whereas the path to the executable is used if we don’t find a running process and need to start a fresh instance.  The process name can be found in the Task Manager’s Details tab. Your Tests Should Run WinAppDriver I can’t tell you how many times I’ve clicked run on my tests only to have them all fail because I forgot to start the WinAppDriver process beforehand.  WinAppDriver is the application that drives the mouse and keyboard clicks, along with getting element IDs, names, classes, etc of the application under test.  Using the same solution WinAppDriver’s examples show for starting any application, you can start the WinAppDriver process as well.   Using IManageSession and BaseSession<T> above, we get:
The default constructor just calls BaseSession<WinAppDriverProcess> with the name of the process and the path to the executable. So you can see that StartSession here is implemented to be thread safe.  This ensures that only one instance can be created in a test session, and that it’s created safely in an environment where you run your tests across multiple threads.  It then queries the base class about whether the application you’re starting is already running or not.  If it is running, we attach to it.  If it’s not, we start a new instance and attach to that.  Here are those methods:
These are both named Unsafe to show that they’re not thread safe, and it’s up to the calling method to ensure thread safety.  In this case, that’s StartSession(). And for completeness, StopSession does something very similar except it queries BaseSession<T> to see if we own the process (i.e. it was started as a fresh instance and not attached to), or not.  If we own it, then we’re responsible for shutting it down, but if we only attach to it, then leave it open.
You’ll Probably Want a DesktopSession Desktop sessions can be useful ways to test elements from the root of the Windows Desktop.  This would include things like the Start Menu, sys-tray, or file explorer windows.  We use it for our sys-tray icon functionality, but regardless of what you need it for, WinAppDriver’s FAQ provides the details, but I’ve made it work here using IManageSession and BaseSession<T>:
It’s a lot simpler since we’d never be required to start the root session.  It’s still helpful to have it inherit from BaseSession<T> as that will provide us some base functionality like storing the instance in a Singleton and knowing how long to wait for windows to appear when switching to/from them. Sessions for Applications with Splash Screens This includes all the Office applications.  WinAppDriver’s FAQ has some help on this, but I think I’ve improved it a bit with the do/while loop to wait for the main window to appear.  The other methods look similar to the above, so I’ve collapsed them for brevity.
Putting it All Together So how do we put all this together and make a test run?  Glad you asked! NUnit I make fairly heavy use of NUnit’s class and method level attributes to ensure things get set up correctly depending on the assembly, namespace, or class a test is run in.  Mainly, I have a OneTimeSetup for the whole assembly that starts WinAppDriver and attaches to the Desktop root session.  
Then I separate my tests into namespaces that correspond to the application under test – in this case, it’s Outlook.  
I then use a OneTimeSetup in that namespace that starts Outlook (or attaches to it). 
Finally, I use SetUp and TearDown attributes on the test classes to ensure I start and end each test from the main application window.
The Test All that allows you to write (the somewhat verbose) test:
Wrapping It All Up For this post we went into the details on how to organize and code your Sessions for UI testing.  We showed you how to design them so you can reuse code between different application sessions.  We also enabled them to either start the application or connect to an already running application instance (and how the Session object can determine which to do itself).  Finally, we put it all together and created a basic test that drives Outlook’s UI to compose a new Email message and send it. Stay tuned for the next post where we’ll delve into how to handle all the dialog windows your UI needs – to interact with and abstract that away – so you can write a full test with something that looks like this:
Tessian Culture Engineering Team
React Hooks at Tessian
By Luke Barnard
16 June 2021
I’d like to describe Tessian’s journey with React hooks so far, covering some technical aspects as we go. About two years ago, some of the Frontend guild at Tessian were getting very excited about a new React feature that was being made available in an upcoming version: React Hooks. React Hooks are a very powerful way to encapsulate state within a React app. In the words of the original blog post, they make it possible to share stateful logic between multiple components. Much like React components, they can be composed to create more powerful hooks that combine multiple different stateful aspects of an application together in one place. So why were we so excited about the possibilities that these hooks could bring? The answer could be found in the way we were writing features before hooks came along. Every time we wrote a feature, we would have to write extra “boilerplate” code using what was, at some point, considered by the React community to be the de facto method for managing state within a React app ─ Redux. As well as Redux, we depended on Redux Sagas, a popular library for implementing asynchronous functionality within the confines of Redux. Combined, these two(!) libraries gave us the foundation upon which to do…very simple things, mostly API requests, handling responses, tracking loading and error states for each API that our app interacted with. The overhead of working in this way showed each feature required a new set of sagas, reducers, actions and of course the UI itself, not to mention the tests for each of these. This would often come up as a talking point when deciding how long a certain task would take during a sprint planning session. Of course there were some benefits in being able to isolate each aspect of every feature. Redux and Redux Sagas are both well-known for being easy to test, making testing of state changes and asynchronous API interactions very straight-forward and very ─if not entirely─ predictable. But there are other ways to keep testing important parts of code, even when hooks get involved (more on that another time). Also, I think it’s important to note that there are ways of using Redux Sagas without maintaining a lot of boilerplate, e.g. by using a generic saga, reducer and actions to handle all API requests. This would still require certain components to be connected to the Redux store, which is not impossible but might encourage prop-drilling. In the end, everyone agreed that the pattern we were using didn’t suit our needs, so we decided to introduce hooks to the app, specifically for new feature development. We also agreed that changing everything all at once in a field where paradigms fall into and out of fashion rather quickly was a bad idea. So we settled on a compromise where we would gradually introduce small pieces of functionality to test the waters. I’d like to introduce some examples of hooks that we use at Tessian to illustrate our journey with them. Tessian’s first hook: usePortal Our first hook was usePortal. The idea behind the hook was to take any component and insert it into a React Portal. This is particularly useful where the UI is shown “above” everything else on the page, such as dialog boxes and modals. The documentation for React Portals recommends using a React Class Component, using the lifecycle methods to instantiate and tear-down the portal as the component mounts/unmounts. Knowing we could achieve the same thing with hooks, we wrote a hook that would handle this functionality and encapsulate it, ready to be reused by our myriad of modals, dialog boxes and popouts across the Tessian portal. The gist of the hook is something like this:
Note that the hook returns a function that can be treated as a React component. This pattern is reminiscent of React HOCs, which are typically used to share concerns across multiple components. Hooks enable something similar but instead of creating a new class of component, usePortal can be used by any (function) component. This added flexibility gives hooks an advantage over HOCs in these sorts of situations. Anyway, the hook itself is very simple in nature, but what it enables is awesome! Here’s an example of how usePortal can be used to give a modal component its own portal:
Just look at how clean that is! One line of code for an infinite amount of behind-the-scenes complexity including side-effects and asynchronous behaviors! It would be an understatement to say that at this point, the entire team was hooked on hooks!   Tessian’s hooks, two months later Two months later we wrote hooks for interacting with our APIs. We were already using Axios as our HTTP request library and we had a good idea of our requirements for pretty much any API interaction. We wanted: To be able to specify anything accepted by the Axios library To be able to access the latest data returned from the API To have an indication of whether an error had occurred and whether a request was ongoing Our real useFetch hook has since become a bit more complicated but to begin with, it looked something like this:
To compare this to the amount of code we would have to write for Redux sagas, reducers and actions, there’s no comparison. This hook clearly encapsulated a key functionality that we have since gone on to use dozens of times in dozens of new features. From here on out, hooks were here to stay in the Tessian portal, and we decided to phase out Redux for use in features. Today there are 72 places where we’ve used this hook or its derivatives ─ that’s 72 times we haven’t had to write any sagas, reducers or actions to manage API requests! Tessian’s hooks in 2021 I’d like to conclude with one of our more recent additions to our growing family of hooks. Created by our resident “hook hacker”, João, this hook encapsulates a very common UX paradigm seen in basically every app. It’s called useSave. The experience is as follows: The user is presented with a form or a set of controls that can be used to alter the state of some object or document in the system. When a change is made, the object is considered “edited” and must be “saved” by the user in order for the changes to persist and take effect. Changes can also be “discarded” such that the form returns to the initial state. The user should be prompted when navigating away from the page or closing the page to prevent them from losing any unsaved changes. When the changes are in the process of being saved, the controls should be disabled and there should be some indication to let the user know that: (a) the changes are being saved, (b) the changes have been saved successfully, or that (c) there was an error with their submission. Each of these aspects require the use of a few different native hooks: A hook to track the object data with the user’s changes (useState) A hook to save the object data on the server and expose the current object data (useFetch) A hook to update the tracked object data when a save is successful (useEffect) A hook to prevent the window from closing/navigating if changes haven’t been saved yet (useEffect) Here’s a simplified version:
As you can see, the code is fairly concise and more importantly it makes no mention of any UI component. This separation means we can use this hook in any part of our app using any of our existing UI components (whether old or new). An exercise for the reader: see if you can change the hook above so that it exposes a textual label to indicate the current state of the saved object. For example if isLoading is true, maybe the label could indicate “Saving changes…” or if hasChanges is true, the text could read “Click ‘Save’ to save changes”. Tessian is hiring! Thanks for following me on this wild hook-based journey, I hope you found it enlightening or inspiring in some way. If you’re interested in working with other engineers that are super motivated to write code that can empower others to implement awesome features, you’re in luck! Tessian is hiring for a range of different roles, so connect with me on LinkedIn, and I can refer you!
Engineering Team
After 18 Months of Engineering OKRs, What Have We Learned?
By Andy Smith
03 June 2021
We have been using OKRs (Objectives and Key Results) at Tessian for over 18 months now, including in the Engineering team. They’ve grown into an essential part of the organizational fabric of the department, but it wasn’t always this way. In this article I will share a few of the challenges we’ve faced, lessons we’ve learned and some of the solutions that have worked for us. I won’t try and sell you on OKRs or explain what an OKR is or how they work exactly; there’s lots of great content that already does this! Getting started When we introduced OKRs, there were about 30 people in the Engineering department. The complexity of the team was just reaching the tipping point where planning becomes necessary to operate effectively. We had never really needed to plan before, so we found OKR setting quite challenging, and we found ourselves taking a long time to set what turned out to be bad OKRs. It was tempting to think that this pain was caused by OKRs themselves. On reflection today, however, it’s clear that OKRs were merely surfacing an existing pain that would emerge at some point anyway. If teams can’t agree on an OKR, they’re probably not aligned about what they are working on. OKRs surfaced this misalignment and caused a little pain during the setting process that prevented a large pain during the quarter when the misalignment would have had a larger impact. The Key Result part of an OKR is supposed to describe the intended outcome in a specific and measurable way. This is sometimes straightforward, typically when a very clear metric is used, such as revenue or latency or uptime. However, in Engineering there are often KRs that are very hard to write well. It’s too easy to end up with a bunch of KRs that drive us to ship a bunch of features on time, but have no aspect of quality or impact. The other pitfall is aiming for a very measurable outcome that is based on a guess, which is what happens when there is no baseline to work from. Again, these challenges exist without OKRs, but they may never precipitate into the conversation about what a good outcome is for a particular deliverable without OKRs there to make it happen. Unfortunately we haven’t found the magic wand that makes this easy, and we still have some binary “deliver the feature” key results every quarter, but these are less frequent now. We will often set a KR to ship a feature in Q1 and to set up a metric and will then set a target for the metric in Q2 once we have a baseline. Or if we have a lot of delivery KRs, we’ll pull them out of OKRs altogether and zoom out to set the KR around their overall impact. An eternal debate in the OKR world is whether to set OKRs top-down (leadership dictate the OKRs and teams/individuals fill out the details) or bottom-up (leadership aggregates the OKRs of teams and individuals into something coherent) or some mixture of the two. We use a blend of the two, and will draft department OKRs as a leadership team and then iterate a lot with teams, sometimes changing them entirely. This takes time, though. Every iteration uncovers misalignment, capacity, stakeholder or research issues that need to be addressed. We’ve sometimes been frustrated and rushed this through as it feels like a waste of time, but when we’ve done this, we’ve just ended up with bigger problems later down the road that are harder to solve than setting decent OKRs in the first place. The lesson we’ve learned is that effort, engagement with teams and old-fashioned rigor are required when setting OKRs, so we budget 3-4 weeks for the whole process. Putting OKRs into Practice The last three points have all been about setting OKRs, but what about actually using them day to day? We’ve learned two things:  the importance of allowing a little flex, and  how frequent – but light – process is needed to get the most out of your OKRs First, flex. Our OKRs are quarterly, but sometimes we need to set a 6 month OKR because it just makes more sense! We encourage this to happen. We don’t obsess about making OKRs ladder up perfectly to higher-level OKRs. It’s nice when they do, but if this is a strict requirement, then we find that it’s hard to make OKRs that actually reflect the priorities of the quarter. Sometimes a month into the quarter, we realize we set a bad OKR or wrote it in the wrong way. A bit of flexibility here is important, but not too much. It’s important to learn from planning failures, but it is probably more important that OKRs reflect teams’ actual priorities and goals, or nobody is going to take them seriously. So tweak that metric or cancel that OKR if you really need to, but don’t go wild. Finally, process. If we don’t actively check in on OKRs weekly, we tend to find that all the value we get from OKRs is diluted. Course-corrections come too late or worries go unsolved for too long. To keep this sustainable, we do this very quickly. I have an OKR check-in on the agenda for all my 1-1s with direct reports, and we run a 15-minute group meeting every week with the Product team where each OKR owner flags any OKRs that are off track, and we work out what we need to do to resolve them. Often this causes us to open a slack channel or draft a document to solve the issue outside of the meeting so that we stick to the strict 15 minute time slot. Many of these lessons have come from suggestions from the team, so my final tip is that if you’re embarking on using OKRs in your Engineering team, or if you need to get them back on track, make sure you set some time aside to run a retrospective. This invites your leaders and managers to think about the mechanics of OKRs and planning, and they usually have the best ideas on how to improve things.
Engineering Team
Tessian’s Client Side Integrations QA Journey – Part I
By Craig Callender
20 May 2021
In this series, we’re going to go over the Quality Assurance journey we’ve been on here in the Client Side Integrations (CSI) team at Tessian. Most of this post will be using our experience with the Outlook Add-in, as that’s the piece of software most used by our clients. But the philosophies and learnings here apply to most software in general (regardless of where it’s run) with an emphasis on software that includes a UI. I’ll admit that the onus for this work was me sitting in my home office Saturday morning knowing that I’d have to start manual testing for an upcoming release in the next two weeks and just not being able to convince myself to click the “Send” button in Outlook after typing a subject of “Hello world” one more time… But once you start automating UI tests, it just builds on itself and you start being able to construct new tests from existing code. It can be such a euphoric experience. If you’ve ever dreaded (or are currently dreading) running through manual QA tests, keep reading and see if you can implement some of the solutions we have. Why QA in the Outlook Add-in Ecosystem is Hard The Outlook Add-in was the first piece of software written to run on our clients’ computers and, as a result of this, alongside needing to work in some of the oldest software Microsoft develops (Outlook), there are challenges when it comes to QA. These challenges include: Detecting faults in the add-in itself Detecting changes in Outlook which may result in functionality loss of our add-in Detecting changes in Windows that may result in performance issues of our add-in Testing the myriad of environments our add-in will be installed in The last point is the hardest to QA, as even a list of just a subset of the different configurations of Outlook shows the permutations of test environments just doesn’t scale well: Outlook’s Online vs Cached mode Outlook edition: 2010, 2013, 2016, 2019 perpetual license, 2019 volume license, M365 with its 5 update channels… Connected to On-Premise Exchange Vs Exchange Online/M365 Other add-ins in Outlook Third-party Exchange add-ins (Retention software, auditing, archiving, etc…) And now add non-Outlook related environment issues we’ve had to work through: Network proxies, VPNs, Endpoint protection Virus scanning software Windows versions One can see how it would be impossible to predict all the environmental configurations and validate our add-in functionality before releasing it. A Brief History of QA in Client Side Integrations (CSI) As many companies do – we started our QA journey with a QA team.  This was a set of individuals whose full time job was to install the latest commit of our add-in and test its functionality. This quickly grew where this team was managing/sharing VMs to ensure we worked on all those permutations above. They also worked hard to try and emulate the external factors of our clients’ environments like proxy servers, weak Internet connections, etc… This model works well for exploratory testing and finding strange edge cases, but where it doesn’t work well or scale well, is around communication (the person seeing the bug isn’t the person fixing the bug) and automation (every release takes more and more person-power as the list of regression issues gets longer and longer). In 2020 Andy Smith, our Head of Engineering, made a commitment that all QA in Tessian would be performed by Developers. This had a large impact on the CSI team as we test an external application (Outlook) across many different versions and configurations which can affect its UI. So CSI set out a three phase approach for the Development team to absorb the QA processes. (Watch how good we are at naming things in my team.) Short-Term The basic goal here was that the Developers would run through the same steps and processes that were already defined for our QA.  This meant a lot of manual testing, manually configuring environments, etc. The biggest learning from our team during this phase was that there needs to be a Developer on an overview board whenever you have a QA department to ensure that test steps actually test the thing you want. We found many instances where an assumption in a test step was made that was incorrect or didn’t fully test something. Medium-Term The idea here was that once the Developers are familiar and comfortable running through the processes defined by the QA department, we would then take over ownership of the actual tests themselves and make edits. Often these edits resulted in the ability to test a functionality with more accuracy or less steps. It also included the ability to stand up an environment that tests more permutations, etc. The ownership of the actual tests also meant that as we changed the steps, we needed to do it with automation in mind. Long-Term Automation. Whether it’s unit, integration, or UI tests, we need them automated. Let a computer run the same test over and over again let the Developers think of ever increasing complexity of what and how we test. Our QA Philosophy Because it would be impossible for us to test every permutation of potential clients’ environments before we release our software (or even an existing client’s environment), we approach our QA with the following philosophies: Software Engineers are in the Best Position to Perform QA This means that the people responsible for developing the feature or bug, are the best people when it comes to writing the test cases needed to validate the change, add those test cases to a release cycle, and to even run the test itself.  The why’s of this could be (and probably will be) a whole post. 🙂 Bugs Will Happen We’re human. We’ll miss something we shouldn’t have. We won’t think of something we should have.  On top of that, we’ll see something we didn’t even think was a possibility. So be kind and focus on the solution rather than the bad code commit. More Confidence, Quicker Our QA processes are here to give us more confidence in our software as quickly as possible, so we can release features or fixes to our clients. Whether we’re adding, editing, or removing a step in our QA process, we ask ourselves if doing this will bring more confidence to our release cycle or will it speed it up.  Sometimes we have to make trade-offs between the two. Never Release the Same Bug Twice Our QA process should be about preventing regressions on past issues just as much as it is about confirming functionality of new features. We want a robust enough process that when an issue is encountered and solved once, that same issue is never found again.  In the least, this would mean we’d never have the same bug with the same root cause again.  At most it would mean that we never see the same type of bug again, as a root cause could be different even though the loss in functionality is the same. An example of this last point is that if our team works through an issue where a virus scanner is preventing us from saving an attachment to disk, we should have a robust enough test that will also detect this same loss in functionality (the inability to save an attachment to disk) for any cause (for example, a change to how Outlook allows access to the attachment, or another add-in stubbing the attachment to zero-bytes for archiving purposes, etc…) How Did We Do? We started this journey with a handful of unit tests that were all automated in our CI environment.   Short-Term Phase During the Short-Term phase, there was an emphasis on new commits ensuring that we had unit tests alongside them.  Did we sometimes make a decision to release a feature with only manual tests because the code base didn’t lend itself to unit testability? YES! But we worked hard to always ensure we had a good reason for excluding unit tests instead of just assuming it couldn’t be done because it hadn’t before. Being flexible, while at the same time keeping your long-term goal in mind is key, and at times, challenging. Medium-Term This phase wasn’t made up of nearly as much test re-writing as we had intentionally set out for.  We added a section to our pull requests to include links to any manual testing steps required to test the new code. This resulted in more new, manual tests being written by developers than edits to existing ones. We did notice that the quality of tests changed.  It’s tempting to say, “for the better”, “or with better efficiency”, but I believe most of the change can more be attributed to an understanding that the tests were now being written for a different audience, namely Developers.  They became a bit more abstract and a bit more technical.  Less intuitive. They also became a bit more verbose as we get a bad taste in our mouth whenever we see a manual step that says something like, “Trigger an EnforcerFilter” with no description on which one? One that displays something to the user or just the admin? Etc…. This phase was also much shorter than we had originally thought it would be. Long-Term This was my favorite phase.  I’m actually one of those software engineers that LOVE writing unit tests. I will attribute this to JetBrains’ ReSharper (I could write about my love of ReSharper all day) interface which gives me oh-so-satisfying green checkmarks as my tests run… I love seeing more and more green checkmarks! We had arranged a long term OKR with Andy, which gave us three quarters in 2021 to implement automation of three of our major modules (Constructor, Enforcer, and Guardian)— with a stretch goal of getting one test working for our fourth major module, Defender.  We blew this out of the water and met them all (including a beta module – Architect) in one quarter.  It was addicting writing UI tests and watching the keyboard and mouse move on its own. Wrapping it All Up Like many software product companies large and small, Tessian started out with a manual QA department composed of technologists but not Software Engineers.  Along the way, we made the decision that Software Engineers need to own the QA of the software they work on. This led us on a journey, which included closer reviews of existing tests, writing new tests, and finally automating a majority of our tests. All of this combined allows us to release our software with more confidence, more quickly. Stay tuned for articles where we go into details about the actual automation of UI tests and get our hands dirty with some fun code.
Engineering Team
How Do You Encrypt PySpark Exceptions?
By Vladimir Mazlov
14 May 2021
We at Tessian are very passionate about the safety of our customers. We constantly handle sensitive email data to improve the quality of our protection against misdirected emails, exfiltration attempts, spear phishing etc. This means that many of our applications handle data that we can’t afford to have leaked or compromised. As part of our efforts to keep customer data safe, we take care to encrypt any exceptions we log, as you never know when a variable that has the wrong type happens to contain an email address. This approach allows us to be liberal with the access we give to the logs, while giving us comfort that customer data won’t end up in them. Spark applications are no exception to this rule, however, implementing encryption for them turned out to be quite a journey. So let us be your guide on this journey. This is a tale of despair, of betrayal and of triumph. It is a tale of PySpark exception encryption.
Problem statement Before we enter the gates of darkness, we need to share some details about our system so that you know where we’re coming from. The language of choice for our backend applications is Python 3. To achieve exception encryption we hook into Python’s error handling system and modify the traceback before logging it. This happens inside a function called init_logging() and looks roughly like this:
We use Spark 2.4.4. Spark Jobs are written entirely in Python; consequently, we are concerned with Python exceptions here. If you’ve ever seen a complete set of logs from a YARN-managed PySpark cluster, you know that a single ValueError can get logged tens of times in different forms; our goal will be to make sure all of them are either not present or encrypted. We’ll be using the following error to simulate an exception raised by a Python function handling sensitive customer information:
Looking at this, we can separate the problem into 2 parts: the driver and the executors. The executors Let’s start with what we initially (correctly) perceived to be the main issue. Spark Executors are a fairly straightforward concept until you add Python into the mix. The specifics of what’s going on inside are not often talked about and are relevant to the discussion at hand, so let’s dive in.
All executors are actually JVMs, not python interpreters, and are implemented in Scala. Upon receiving Python code that needs to be executed (e.g. in rdd.map) they start a daemon written in Python that is responsible for forking the worker processes and supplying them with means of talking to the JVM, via sockets.  The protocol here is pretty convoluted and very low-level, so we won’t go into too much depth. What will be relevant to us are two details; both have to do with communication between the driver and the JVM: The JVM executor expects the daemon to open a listening socket on the loopback interface and communicate the port back to it via stdout The worker code contains a general try-except that catches any errors from the application code and writes the traceback to the socket that’s read by the JVM Point 2 is how the Python exceptions actually get to the executor logs, which is exactly why we can’t just use init_logging, even if we could guarantee that it was called: Python tracebacks are actually logged by Scala code! How is this information useful? Well, you might notice that the daemon controls all Python execution, as it spawns the workers. If we can make it spawn a worker that will encrypt exceptions, our problems are solved. And it turns out Spark has an option that does just that: spark.python.daemon.module. This solution actually works; the problem is it’s incredibly fragile: We now have to copy the code of the driver, which makes spark version updates difficult Remember, it communicates the port to the JVM via stdout. Anything else written to stdout (say, a warning output by one of the packages used for encryption) destroys the executor:
As you can probably tell by the level of detail here, we really did think we could do the encryption on this level. Disappointed, we went one level up and took a look at how the PythonException was handled in the Scala code. Turns out it’s just logged on ERROR level with the Python traceback received from the worker treated as the message. Spark uses log4j, which provides a number of options to extend it; Spark, additionally, provides the option to override log processing using its configuration.  Thus, we will have achieved our goal if we encrypted the messages of all exceptions on log4j level. We did it by creating a custom RealEncryptExceptionLayout class that simply calls the default one unless it gets an exception, in which case it substitutes it with the one with an encrypted message. Here’s how it broadly looks:
To make this work we shipped this as a jar to the cluster and, importantly, specified the following configuration:
And voila! The driver: executor errors by way of Py4J Satisfied with ourselves, we decided to grep the logs for the error before moving on to errors in the driver. Said moving on was not yet to be, however, as we found the following in the driver’s stdout:
This not only is incredibly readable but also not encrypted! This exception, as you can very easily tell, is thrown by the Scala code, specifically DAGScheduler, when a task set fails, in this case due to repeated task failures.  Fortunately for us, as illustrated by the diagram above, the driver simply runs python code in the interpreter that, as far as it’s concerned, just happens to call py4j APIs that, in turn, communicate with the JVM. Thus, it’s not meaningfully different from our backend applications in terms of error handling, so we can simply reuse the init_logging() function. If we do it and check the stdout we see that it does indeed work:
The driver: executor errors by way of TaskSetManager Yes, believe it or not, we haven’t escaped the shadow of the Executor just yet. We’ve seen our fair share of the driver’s stdout. But what about stderr? Wouldn’t any reasonable person expect to see some of those juicy errors there as well? We pride ourselves on being occasionally reasonable, so we did check. And lo and behold:
Turns out there is yet another component that reports errors from the executors: TaskSetManager; our good friend DAGScheduler also logs this error when a stage crashes because of it. Both of them, however, do this while processing events initially originating in the executors; where does the traceback really come from? In a rare flash of logic in our dark journey, from the Executor class, specifically the run method:
Aha, there’s a Serializer here! That’s very promising, we should be able to extend/replace it to encrypt the exception before actual serialization, right? Wrong. In fact, to our dismay, that used to be possible but was removed in version 2.0.0 (reference: https://issues.apache.org/jira/browse/SPARK-12414). Seeing as how nothing is configurable at this end, let’s go back to the TaskSetManager and DAGScheduler and note that the offending tracebacks are logged by both of them. Since we are already manipulating the logging mechanism, why not go further down that lane and encrypt these logs as well? Sure, that’s a possible solution. However, both log lines, as you can see in the snippet, are INFO. To find out that this particular log line contains a Python traceback from an executor we’d have to modify the Layout to parse it. Instead of doing that and risking writing a bad regex (a distinct possibility as some believe a good regex is an animal about as real as a unicorn) we decided to go for a simple and elegant solution. We simply don’t ship the .jar containing the Layout to the driver; like we said, elegant. That turns out to have the following effect:
And that’s all that we find in the stderr! Which suits us just fine, as any errors from the driver will be wrapped in Py4J, diligently reported in the stdout and, as we’ve established, encrypted. The driver: python errors That takes care of the executor errors in the driver. But the driver is nothing to sniff at either. It can fail and log exceptions just as well, can’t it? As you have probably already guessed, this isn’t really a problem. After all, the driver is just running python code, and we’re already calling init_logging(). Satisfyingly enough it turns out to work as one would expect. For these errors we again need to look at the driver’s stdout. If we raise the exception in the code executed in the driver (i.e. the main function) the stdout normally contains:
Calling init_logging() turns this traceback into:
Conclusion And thus our journey comes to an end. Ultimately our struggle has led us to two realizations; neither is particularly groundbreaking, but both are important to understand when dealing with PySpark: Spark is not afraid to repeat itself in the logs, especially when it comes to errors. PySpark specifically is written in such a way that the driver and the executors are very different. Before we say our goodbyes we feel like we must address one question: WHY? Why go through with this and not just abandon this complicated project?  Considering that the data our Spark jobs tend to handle is very sensitive, in most cases it is various properties of emails sent or received by our customers. If we give up on encrypting the exceptions, we must accept that this very sensitive information could end up in a traceback, at which point it will be propagated by Spark to various log files. The only real way to guarantee no personal data is leaked in this case is to forbid access to the logs altogether. And while we did have to descend into the abyss and come back to achieve error encryption, debugging Spark jobs without access to logs is inviting the abyss inside yourself.
Engineering Team
How We Improved Developer Experience in a Rapidly Growing Engineering Team
By Andy Smith
16 April 2021
Developer experience is one of most important things for a Head of Engineering to care about. Is development safe and fast? Are developers proud of their work? Are our processes enabling great collaboration and getting the best out of the team?  But sometimes, developer experience doesn’t get the attention it deserves. It is never the most urgent problem to solve, there are lots of different opinions about how to make improvements, and it seems very hard to measure.  At Tessian the team grows and evolves very quickly; we’ve gone from 20 developers to over 60 in just 3 years.  When the team was smaller, it was straightforward to keep a finger on the pulse of developer experience. With such a large and rapidly growing team, it’s all too easy for developer experience to be overshadowed by other priorities. At the end of 2020, it became clear that we needed a way to get a department-wide view of the perception of our developer experience that we could use to inform decisions and see whether those decisions had an impact. We decided one thing that would really help is a regular survey.  This would help us spot patterns quickly and it would give us a way to know if we were improving or getting worse. Most importantly it gives everyone in the team a chance to have their say and to understand what others are thinking.  Borrowing some ideas from Spotify, we sent the survey out in January to the whole Engineering team to get their honest, anonymized feedback. We’ll be repeating this quarterly.  Here are some of the high-level topics we covered in the survey. Speed and ease To better understand if our developers feel they can work quickly and securely, we asked the following questions: How simple, safe and painless is it to release your work? Do you feel that the speed of development is high? !function(e,t,s,i){var n="InfogramEmbeds",o=e.getElementsByTagName("script"),d=o[0],r=/^http:/.test(e.location)?"http:":"https:";if(/^\/{2}/.test(i)&&(i=r+i),window[n]&&window[n].initialized)window[n].process&&window[n].process();else if(!e.getElementById(s)){var a=e.createElement("script");a.async=1,a.id=s,a.src=i,d.parentNode.insertBefore(a,d)}}(document,0,"infogram-async","//e.infogram.com/js/dist/embed-loader-min.js");
You can see we got a big spread of answers, with quite a few detractors. We looked into this more deeply and identified that the primary driver for this is that some changes cannot be released independently by developers; some changes have a dependency on other teams and this can slow down development.  We’d heard similar feedback before running the survey which had led us to start migrating from Amazon ECS to Kubernetes. This would allow our Engineering teams to make more changes themselves. It was great to validate this strategy with results from the survey. More feedback called out a lack of test automation in an important component of our system.  We weren’t taking risks here, but we were using up Engineering time unnecessarily. This led to us deciding to commit to a project that would bring automation here. This has already led to us finding issues 15x quicker than before:
Autonomy and satisfaction We identified two areas of strength revealed by asking the following questions: How proud are you of the work you produce and the impact it has for customers? How much do you feel your team has a say in what they build and how they build it? !function(e,t,s,i){var n="InfogramEmbeds",o=e.getElementsByTagName("script"),d=o[0],r=/^http:/.test(e.location)?"http:":"https:";if(/^\/{2}/.test(i)&&(i=r+i),window[n]&&window[n].initialized)window[n].process&&window[n].process();else if(!e.getElementById(s)){var a=e.createElement("script");a.async=1,a.id=s,a.src=i,d.parentNode.insertBefore(a,d)}}(document,0,"infogram-async","//e.infogram.com/js/dist/embed-loader-min.js");
These are two areas that we’ve always worked very hard on because they are so important to us at Tessian. In fact, customer impact and having a say in what is built are the top two reasons that engineers decide to join Tessian.  We’ve recently introduced a Slack channel called #securingthehumanlayer, where our Sales and Customer Success teams share quotes and stories from customers and prospects who have been wowed by their Tessian experience or who have avoided major data breaches (or embarrassing ‘Oh sh*t’ moments!).  We’ve also introduced changes to how OKRs are set, which gives the team much more autonomy over their OKRs and more time to collaborate with other teams when defining OKRs. Recently we launched a new product feature, Misattached File Prevention. Within one hour of enabling this product for our customers, we were able to share an anonymised story of an awesome flag that we’d caught.
What’s next? We’re running the next survey again soon and are excited to see what we learn and how we can make the developer experience at Tessian as great as possible.
Compliance Tessian Culture Engineering Team
Securing SOC 2 Certification
By Trevor Luker
30 March 2021
Building on our existing ISO 27001 security certification, Tessian is excited to announce that we have achieved Service Organization Control 2 Type 2 (SOC 2) compliance in the key domains of Security, Confidentiality and Availability with zero exceptions on our very first attempt. Achieving full SOC 2 Type 2 compliance within 6 months is simply sensational and is a huge achievement for our company. It reinforces our message to customers and prospects that Information Security and protecting customer data is at the very core of everything Tessian does.
The Journey We began the preparations for SOC 2 in September 2020 and initiated the formal process in October. Having previously experienced the pain and trauma of doing SOC 2 manually, we knew that to move quickly, we needed tooling to assist with the evidence gathering and reporting.  Fortunately we were introduced to VANTA, which automates the majority of the information gathering tasks, allowing the Tessian team to concentrate on identifying and closing any gaps we had. VANTA is a great platform, and we would recommend it to any other company undertaking SOC 2 or ISO 27001 certification. For the external audit part of the process, we were especially fortunate to team up with Barr Advisory who proactively helped us navigate the maze of the Trust Service Criteria requirements. They provided skilled, objective advice and guidance along the way, and we would particularly like to thank Cody Hewell and Kyle Helles for their insights, enthusiasm and support. Tessian chose an accelerated three month observation period, which in turn, put a lot of pressure on internal resources to respond to information requests and deliver process changes as required. The Tessian team knew how important SOC 2 was to us strategically and rallied to the challenge. Despite some extremely short timeframes, we were able to deliver the evidence that the auditors needed.  A huge team effort and a great reflection of Tessian’s Craft At Speed value. What Next? Achieving SOC 2 Type 2 is a crucial step for Tessian as we expand further into the large enterprise space. It’s also the basis on which we will further develop our compliance and risk management initiatives, leading to specialized government security accreditation in the US and Europe over the next year or two.
Tessian Culture Engineering Team
Early adoption: Is Now the Time to Invest in the ‘New Breed’ of Security Products?
By Phil O'Hagan
25 February 2021
There’s an (unfair!) perception in the industry that most CISOs are skeptical, or at least conservative, when it comes to adopting the latest security technology. But the role of the CISO is evolving. It’s no longer to simply “own” risk. Today, they’re also tasked with educating and informing everybody within the company – including the C-Suite – on the risks and what can be done to mitigate them.  In this fast moving world, it’s no longer possible to be passive. Only those who are open-minded (and ideally progressive) will protect their company from the most advanced threats. A year of firsts  The security industry is moving in a different direction. We need only look back at the last 12 months to see why: COVID has raised the profile for security.  A greater attack profile has caught the attention of executive teams, and they are looking to CISOs to respond. But, it’s not all bad news. Just as cybercriminals see opportunity in disruption, CISOs have an opportunity to play a bigger role at the executive level. The digital transformation has been accelerated. The shift to remote working means an increased attack surface. Today, security teams must support whole departments of remote workers as they engage with technology in their kitchens, bedrooms, and coffee shops. CISOs need to do more than send the occasional email or facilitate annual training to raise awareness about cyber threats.  Ransomware is an ever-growing threat. In fact, almost a third of victims pay a ransom, which means the stakes are higher than ever.  Attackers have improved the implementation of their encryption schemes, making them harder to crack. And, rather than simply encrypting critical data, some criminals now steal sensitive data and threaten to release it if the ransom is not paid.  With so much changing, CISOs have to adapt fast and adopt new technology to succeed. Gartner calls this period of early adoption a “hype cycle”.  And, for any new innovation, early publicity produces a number of success stories — often accompanied by scores of failures. Some companies take action; many do not. Where do you stand? The technology balance Both inside and outside of security, there are plenty of arguments both for and against new technologies:
Given the rapidly evolving threat landscape, though, CISOs should be pushed harder than most to commit fully to the leading edge of security innovation. After all, “nobody got fired for buying IBM” and “fortune favors the brave“, right? The next generation of security  More and more CISOs are choosing to be brave. Why? It comes down to the modern way this next generation of security is being designed and built.  Today’s security benefits  are focused on cutting the risk out of early participation while amplifying the benefits. At the heart of the change are two related trends:  Next-generation security services  The advancement of machine learning The next generation of security services has removed the need for CISOs to be experts on negotiating IT project. Instead, they can focus on managing the risks to their business.  For example, with cloud services, the costs of infrastructure – and efforts of supporting it – are completely removed as the services you buy are scalable to match the business. Cloud services also require no maintenance or professional assistance beyond an internet browser. The cloud means that the hurdles and expense associated with “trying out something new” are hugely mitigated. And, because these next-gen security services are hosted on the cloud, you’ll always have the latest version.  There is only one “copy” of these software tools. That means upgrade cycles have come down from once a year to multiple times a day. Better still, these services connect to one another. This equates to a shallower learning curve for users, faster time-to-market, and the flexibility to bolt on future tools that suit the way you want to run your operation.
Legacy technology vs. machine learning Whereas legacy technology uses rule-based techniques to secure organizational risks, new providers leverage machine learning to provide accurate, automatic protection, and visibility against advanced risks, otherwise impossible to detect with legacy systems. Machine learning’s goal is to understand the structure of the data and fit theoretical distributions to that data that is well understood. And, because machine learning often uses an iterative approach to learn from data, the learning can be easily automated. Passes are run through the data until a robust pattern is found. In an ever-evolving security world, this allows for the identification of specific risks. By using machine learning algorithms to build models that uncover these connections, organizations can make better decisions without human intervention. For example, identifying anomalous behaviors that form part of the most advanced threats in the enterprise. The benefit for CISOs – and their security teams – is clear. Lower time commitment to identify and remediate issues and more accurate reporting on the risks to the business. These next generation tools also achieve something legacy systems can’t and don’t: they share de-identified data between customers to ensure everyone is protected, even from threats that haven’t (yet) been seen in their own network. The benefit? Organizations continually – and automatically – improve their protection against an ever-changing threat environment. Low risk, high reward  Finally, like never before – and because these services are in the cloud – security leaders are in a position to switch on new services at low risk, without any upfront investment.  With no upfront CapEx, chances are that your first steps will be below any procurement ceiling too – so PoCs become simple to execute. It becomes rational to test a service or strategy with a small team before rolling out more broadly.  And, because the barrier to try (and switch!) for these early adopters is so low, “try before you buy” is a prevalent trend. With low switching costs, the software developers behind the scenes have a wholehearted commitment to making the trial period compelling enough to convince you to take the next step. They have skin in the game and understand that happy customers dictate whether or not a product is successful. This lowering of barriers, enabling of small-scale testing, and offsetting of cost should all make it a little more tempting for CISOs to take the leap and occasionally try for first-mover status. Because adopting innovative practices has never been so low-risk and the rewards are well-worth it.  To name a few… improving your security posture, reducing admin, and protecting your employees from ever-evolving threats.
Engineering Team
A Solution to HTTP 502 Errors with AWS ALB
By Samson Danziger
03 November 2020
At Tessian, we have many applications that interact with each other using REST APIs. We noticed in the logs that at random times, uncorrelated with traffic, and seemingly unrelated to any code we had actually written, we were getting a lot of HTTP 502 “Bad Gateway” errors. Now that the issue is fixed, I wanted to explain what this error means, how you get it and how to solve it. My hope is that if you’re having to solve this same issue, this article will explain why and what to do.  First, let’s talk about load balancing
In a development system, you usually run one instance of a server and you communicate directly with it. You send HTTP requests to it, it returns responses, everything is golden.  For a production system running at any non-trivial scale, this doesn’t work. Why? Because the amount of traffic going to the server is much greater, and you need it to not fall over even if there are tens of thousands of users.  Typically, servers have a maximum number of connections they can support. If it goes over this number, new people can’t connect, and you have to wait until a new connection is freed up. In the old days, the solution might have been to have a bigger machine, with more resources, and more available connections. Now we use a load balancer to manage connections from the client to multiple instances of the server. The load balancer sits in the middle and routes client requests to any available server that can handle them in a pool.  If one server goes down, traffic is automatically routed to one of the others in the pool. If a new server is added, traffic is automatically routed to that, too. This all happens to reduce load on the others.
What are 502 errors? On the web, there are a variety of HTTP status codes that are sent in response to requests to let the user know what happened. Some might be pretty familiar: 200 OK – Everything is fine. 301 Moved Permanently – I don’t have what you’re looking for, try here instead.  403 Forbidden – I understand what you’re looking for, but you’re not allowed here. 404 Not Found – I can’t find whatever you’re looking for. 503 Service Unavailable – I can’t handle the request right now, probably too busy. 4xx and 5xx both deal with errors.  4xx are for client errors, where the user has done something wrong. 5xx, on the other hand, are server errors, where something is wrong on the server and it’s not your fault.  All of these are specified by a standard called RFC7231. For 502 it says: The 502 (Bad Gateway) status code indicates that the server, while acting as a gateway or proxy, received an invalid response from an inbound server it accessed while attempting to fulfill the request. The load balancer sits in the middle, between the client and the actual service you want to talk to. Usually it acts as a dutiful messenger passing requests and responses back and forth. But, if the service returns an invalid or malformed response, instead of returning that nonsensical information to the client, it sends back a 502 error instead.  This lets the client know that the response the load balancer received was invalid.
The actual issue Adam Crowder has done a full analysis of this problem by tracking it all the way down to TCP packet capture to assess what’s going wrong. That’s a bit out of scope for this post, but here’s a brief summary of what’s happening: At Tessian, we have lots of interconnected services. Some of them have Application Load Balancers (ALBs) managing the connections to them.  In order to make an HTTP request, we must open a TCP socket from the client to the server. Opening a socket involves performing a three-way handshake with the server before either side can send any data.  Once we’ve finished sending data, the socket is closed with a 4 step process. These 3 and 4 step processes can be a large overhead when not much actual data is sent. Instead of opening and then closing one socket per HTTP request, we can keep a socket open for longer and reuse it for multiple HTTP requests. This is called HTTP Keep-Alive. Either the client or the server can then initiate a close of the socket with a FIN segment (either for fun or due to timeout).
The 502 Bad Gateway error is caused when the ALB sends a request to a service at the same time that the service closes the connection by sending the FIN segment to the ALB socket. The ALB socket receives FIN, acknowledges, and starts a new handshake procedure. Meanwhile, the socket on the service side has just received a data request referencing the previous (now closed) connection. Because it can’t handle it, it sends an RST segment back to the ALB, and then the ALB returns a 502 to the user. The diagram and table below show what happens between sockets of the ALB and the Server.
How to fix 502 errors It’s fairly simple.  Just make sure that the service doesn’t send the FIN segment before the ALB sends a FIN segment to the service. In other words, make sure the service doesn’t close the HTTP Keep-Alive connection before the ALB.  The default timeout for the AWS Application Load Balancer is 60 seconds, so we changed the service timeouts to 65 seconds. Barring two hiccoughs shortly after deploying, this has totally fixed it. The actual configuration change I have included the configuration for common Python and Node server frameworks below. If you are using any of those, you can just copy and paste. If not, these should at least point you in the right direction.  uWSGI (Python) As a config file: # app.ini [uwsgi] ... harakiri = 65 add-header = Connection: Keep-Alive http-keepalive = 1 ... Or as command line arguments: --add-header "Connection: Keep-Alive" --http-keepalive --harakiri 65 Gunicorn (Python) As command line arguments: --keep-alive 65 Express (Node) In Express, specify the time in milliseconds on the server object. const express = require('express'); const app = express(); const server = app.listen(80); server.keepAliveTimeout = 65000
Looking for more tips from engineers and other cybersecurity news? Keep up with our blog and follow us on LinkedIn.
Engineering Team
Safely migrating millions of API requests
By Andy Smith
10 March 2020
In December we successfully flipped around half a billion monthly API requests from our Ruby on Rails application to some new Python 3 applications. Now that the dust has settled, and we’re comfortable that all has gone well, I wanted to write up the details of the project, give a bit of a history of Engineering at Tessian, and share some lessons learned in the hope that others may benefit from mistakes we’ve made. In the beginning, there was Rails Long before Tessian became what it is today, most of our code base for our backend infrastructure was written in Ruby on Rails. This was the right choice of technology at the time; it allowed us to produce a reliable product while iterating quickly. But as we grew it became apparent that being able to share production code with our data science team (who predominantly work in Python) would allow us to move much more quickly. That was when we decided to build out some core backend functionality using Python 3. This would allow our backend code to lean heavily on various open source tooling for data science and machine learning. That decision was made 3 years ago and today, in hindsight, it still looks like the right call. However, and you may be ahead of me already here, deciding to start using Python did not magically get rid of all the Ruby code we had already written. That was the situation one of our Engineering teams found themselves pondering in August last year.  Deciding to migrate At Tessian our teams have themes to help define their place in the world. Themes are mini mission statements that ladder up to Tessian’s greater mission to secure the human layer. One team’s theme is “Tessian’s stellar security reputation aids growth”. Since the team’s inception, they had been focusing on building security features in our Python code, where a lot of backend development and data processing takes place. While debating the most important thing to work on next, we decided to dig in to some data. This showed that 500,000 API requests an hour were being handled by our Ruby on Rails application server. Looking at that number, coupled with the fact that we had grown the Engineering team 100% in the past year and hired 0 Ruby developers, it quickly became apparent that this part of our code base needed some attention. The following factors ultimately contributed to our decision: The proportion of Ruby experts in the company was depleting. Improvements to code linting, security frameworks were getting added to Python and not Ruby. Our Ruby app had not kept up with improvements we had made to monitoring and alerting. Ruby was the original code base and contained some of the oldest and least well understood code in the company. There were some tickets in our backlog around poor performance of some of the Ruby endpoints, meaning future development of them was likely. So the decision was made: we would port the existing Ruby APIs to our Python code base, allowing them to make use of our latest frameworks and practices. Path to migration Given the high volume of traffic and the importance of the APIs we wanted to ensure that we kept risk to a minimum when porting them. With this code came other challenges such as poorly defined interfaces and many different client versions.  After a few whiteboard sessions, we settled on a phased approach that went as follows. Phase 0 – Existing setup The original setup – clients talking to our Ruby application.
Phase 1 – Transparent proxying First we built a new Python application to transparently proxy traffic to our Ruby application. We slotted this in to the API flow. Because it just proxied traffic, this was a relatively safe operation.
Phase 2 – Response generation and comparison The next step (and we did this for each API that we migrated) was to implement the API and use live production data to compare “what we would have sent”, being generated by the Python App, (new_response) with “what we should have and did send” (old_response). By comparing the responses, we could catch errors in the implementation based on live production data and fix them. Note that this was not perfect; most of our APIs mutated state in a database in a way that was not idempotent. This meant that we never wanted both Ruby and Python to affect the database – it was either one or the other.
Phase 3 – Switchover Once we had confidence that the response being returned was correct, it was time to stop using Ruby to affect the database and start using Python. Note that because we did not want conflicts between Ruby and Python both altering database state, as we switched over, we stopped calling Ruby. As mentioned above, this had not been tested on production, so still had with it some risk. So we did this in a staged approach, first routing 10% of traffic, then 50% then 100%. Focusing first on our internal dog food tenancy.
And that was it! Once we had done this for all APIs the amount of Ruby code in production was drastically reduced. Retrospective At Tessian we believe that we will build the best teams and product by being open about when things go wrong; we also believe in creating a blameless culture. We suffered one incident as a result of this migration. This had a very limited impact on customers. It caused a minor degradation to our web UI only – not our predictions. We were able to retroactively fix the symptoms, but this was something that we took seriously. The issue was that one of our new Python APIs did not update a database column that the Ruby APIs previously did. In hindsight we think that our comparison framework gave us a false sense of security in the code porting. Our key takeaway was that next time we will have to be more conscious of what it would catch (us breaking our APIs) and what it would not (incorrect database updates).  If we were to do it again, we would do it mostly the same way, but with more thorough code review on these components. All in all we consider the project to be a success and believe that it will aid Tessian’s stellar security reputation thanks to the amazing hard work of the all engineers who worked on this project!
Tessian Culture Engineering Team
Data Science at Tessian is all about Passion And Curiosity
13 September 2019
Tessian Data Scientists design the algorithms that are at the heart of what we do. We wouldn’t exist without our machine learning models, and it’s what our clients rely on day-to-day. But what does this mean in practice? Companies can leverage data science in a number of ways, and we think the role of a Data Scientist falls into three distinct categories: 1. You work for a business function analyzing & reporting on how to improve a key metric; e.g. increasing user conversion. 2. You are responsible for writing production models to enhance the main product; e.g recommendation systems for e-commerce. 3. You build machine learning models, which are the product the company sells. First, we love email More specifically we love massive enterprise email datasets. Email doesn’t have the best reputation with engineers – the protocol is ancient, poorly defined, even more poorly secured and email isn’t Slack. As a Data Science team we don’t think of email in terms of SMTP, but rather a beautiful, dynamic and pretty-huge JSON dataset that captures the intricacies of human-to-human communication. Email knows who you communicate with, what you communicate about, what clients you’re pitching, what projects you’ve just completed, who your team members are, your company hierarchy, (excitingly) the list goes on.
Of course, all this information is hidden, messy and unstructured But that’s where we come in. Using machine learning and NLP, we build bespoke anomaly detection models to prevent threats executable by humans (rather than code) to secure our client’s communications. We also care deeply about the end user experience of our products, which sounds obvious, but is much more difficult (and in our opinion, important) when machine learning is involved due to its black box nature. This means we spend a lot of time focussing on the explainability of our machine learning predictions back to the end user. For example, why does this email look misdirected? Why does this email you’re receiving look malicious? Notifications with context are more effective than lazy boilerplate warnings (the industry standard). Another exciting part of being a Data Scientist at Tessian is that we are always thinking about future products we should be building. The great thing about email data is that it’s not “opinionated” about the problem we are trying to solve, meaning we can experiment with solving different problems using the exact same dataset. Sometimes this involves us trying to find signal in the noise, like when we discovered strong-form spear phishing impersonation attacks were getting past existing defenses. Other times it involves trying to solve specific threats and problems brought to us by our clients. The highlight of my week is the Data Science Brainstorming session where we discuss all of our ideas for new products, current product improvements, as well as any new papers/tools that we’ve read about that might help us further our research and models.
One thing I’ve touched on a lot, but not specifically discussed is data And that’s why at Tessian it’s impossible to talk about Data Science without talking about Engineering. To train our machine learning algorithms we need lots of data, and to deploy and run our models in production, we need this data processed with minimal latency. Our Data Science team own the data and what we do with it. But to process, store and scale this data, we call in the Engineering teams to help. How Data Science and Engineering work together is a much discussed topic and one for which I believe there is no out-of-the-box solution. We’re still figuring it out and tweaking our process, but currently we follow a similar model to Jeff Magnusson’s (Stitch Fix). It’s explained here in Engineers Shouldn’t Write ETL. The platform and Engineering teams leverage their domain knowledge to build and expose data frameworks, empowering our Data Scientists to have end-to-end ownership of their work. This has the added benefit of freeing up Engineering teams to focus on building and scaling our core APIs, rather than maintaining fiddly data pipelines. We’re a friendly bunch of ambitious Engineers building breakthrough machine learning and natural language technologies to analyze, understand & protect enterprise communication networks. Tessian Stack Overflow     #engineering
Engineering Team
Introducing Catapult: Tessian’s Very Own Release Tool
30 June 2019
Today we’re excited to open source our internal release tool – Catapult. At Tessian we run our CI/CD pipelines from Concourse. (Like many, we picked Concourse because it’s not Jenkins*, but we’ll save that for another blog post). Although Concourse is a fantastic build tool that cures a lot of headaches for us, as the creators will readily admit, it is not necessarily a tool with the most advanced security setup. As a company that deals with some of the world’s most sensitive data, this was not good enough for us. We wanted a release tool with security features like two-factor authentication and an audit trail that we had come to expect from other tools we use day to day. At Tessian we also empower our development teams to release and maintain their own services, so we wanted a system that allowed for permissioning. After some head scratching, it became apparent that we didn’t need to reinvent the wheel. By driving our releases from files stored in S3 and making use of Concourse resources, we could meet all of our requirements and more. This was our list of demands: • Fine-grained permissioning • An extensive audit trail • Flexibility • Two-factor Authentication • High Speed & High Availability • Usability So what exactly is Catapult? Catapult is two things: • a command line tool that manages state in S3 • a Concourse Resource, that consumes said S3 bucket The permissioning is all managed on the AWS side and left as an exercise to the reader. Command line The catapult command showing a new release In the background this is doing a number of checks. It’s looking at S3, git and our docker repository. Assuming they have the correct permissions, this will update a file in S3, which our Catapult Concourse Resource is monitoring. Concourse resource When the resource discovers a new version of the file, it will download it; create a new version of the Concourse resource; display all the above metadata; and – assuming it is set up to do so – trigger a new task. From here you can do whatever you want with the version managed in Concourse. What next? We think there’s plenty of work left to do on Catapult but wanted to share what we’ve built thus far with the world. We’re very keen to hear feedback, please send us a pull request or issue on Github! *We think TheNewStack give a nice summary of some issues we’ve had with Jenkins in past lives: https://thenewstack.io/many-problems-jenkins-continuous-delivery/       #engineering
Page
[if lte IE 8]
[if lte IE 8]