Request a Demo of Tessian Today.

Automatically stop data breaches and security threats caused by employees on email. Powered by machine learning, Tessian detects anomalies in real-time, integrating seamlessly with your email environment within minutes and starting protection in a day. Provides you with unparalleled visibility into human security risks to remediate threats and ensure compliance.

State of Email Security 2022: Every Company’s Riskiest Channel |  Read the Full Report →

Engineering Team
Why Confidence Matters: Experimental Design
by Cassie Quek Wednesday, January 19th, 2022
This post is part three of Why Confidence Matters, a series about how we improved Defender’s confidence score to unlock a number of important features. You can read part one here and part two here.   Bringing our series to a close, we explore the technical design of our research pipeline that enabled our Data Scientists to iterate over models with speed. We aim to provide an insight into how we solved issues pertaining to our particular dataset, and conclude with how this project had an impact for our customers and product.   Why design a pipeline?   Many people think that a Data Scientist’s job is like a Kaggle competition – you throw some data at a model, get the highest scores, and boom, you’re done! In reality, building a product such as Tessian Defender was never going to be a one-off job. The challenges of making a useful machine learning (ML) model in production lies not only in its predictive powers, but also in its speed of iteration, reproducibility, and ease of future improvements.   At Tessian, our Data Scientists oversee the project end-to-end, from conception and design, all the way through to deployment in production, monitoring, and maintenance. Hence, our team started by outlining the above longer-term requirements, then sat down together with our Engineers to design a research flow that would fulfill these objectives.   Here’s how we achieved the requirements we set out.
The research pipeline
The diagram above shows the design of the pipeline with its individual steps, starting from the top left. An overall configuration file specifies many parameters for the pipeline, such as the date range for the email data we’ll be using and the features we’ll compute. The research pipeline is then run on Amazon Sagemaker, and takes care of everything from ingesting the checked email data from S3 (Collect Logs step) to training and evaluating the model (at the bottom of the diagram).   Because the pipeline is split into independent and configurable “steps”, each storing its output before the next picks it up, we were able to iterate quickly. This provided flexibility to configure and re-run from any step without having to re-run all the previous steps, which allowed for experimentation at speed.    In our experience, we had only to revise the slowest data collection and processing steps a couple of times to get it right (steps 1-3), and most work and improvements involved experimenting with the features and model training steps (steps 4-5). The later research steps take only a few minutes to run as opposed to hours for the earlier steps, and allow us to test features and obtain answers about them quickly.
Five Key Steps within the Pipeline   Some of these will be familiar to any Data Science practitioner. We’ll leave out general descriptions of these well-known ML steps, and instead focus on the specific adjustments we made to ensure the confidence model worked well for the product.   1. Collect Logs This step collects all email logs with user responses from S3 and transforms them to a format suitable for later use, stored separately per customer. These logs contain information on decisions made by Tessian Defender, using data available at the time of the check. We also lookup and store additional information to enrich and add context to the dataset at this stage.   2. Split Data The way we choose to create the training and test datasets is very important to the model outcome. As mentioned before, consistency in model performance across different cuts of the data is a major concern and success criterion.     In designing our cross-validation strategy, we utilized both time-period hold-outs and a tenants hold-out. The time-period hold-out allows us to confirm that the model generalizes well across time even as the threat landscape changes, while testing on a tenant hold-out ensures the model generalises well across all our customers, that are spread across industries and geographical regions. Having this consistency means that we can confidently onboard new tenants and maintain a similar predictive power of Tessian Defender on their email traffic.   However, the downside to having multiple hold-outs is that we’re effectively throwing out data that did not fit within both constraints for each dataset, as demonstrated in the chart below.
We eventually compromised by allowing a slight overlap between train and validation tenants (but not on test tenants), minimizing the data discarded where possible.   3. Labels Aggregation In part two, we also highlighted that one of the challenges of the user-response dataset is mislabelled data. Greymail and spam are often wrongly labeled as phishing, and can cause the undesired effect of the model prioritizing spam, making the confidence score less meaningful for admins. Users also often disagree on whether the same email is safe or malicious. This step takes care of these concerns by cleaning out spam and aggregating the labels.   In order to assess the quality of user-feedback, we first estimated the degree of agreement between user-labels and security expert labels using a sample of emails, and found that user-labels and expert-labels matched in around 85% of cases. We addressed the most systematic bias observed in this exercise by developing a few simple heuristics to correct cases where users reported spam emails as malicious.    Where we have different labels for copies of the same email sent to multiple users, we applied an aggregation formula to derive a final label for the group. This formula is configurable, and carefully assessed to provide the most accurate labels.   4. Features This step is where most of the research took place – trialing new feature ideas and iterating on them based on feature analysis and metrics from the final step.    The feature computation actually consisted of two independently configurable steps: one for batch features and another for individually computed features. The features consisted of some natural language processing (NLP) vectorizations which were computed faster as a batch, and were more or less static after initial configurations. Splitting it out simplified the structure and maximized our flexibility.    Other features based on stateful values (dependent on the time of the check) such as domain reputations and information from external datasets were computed or extracted individually, such as whether any of the URL domains in the email was registered recently.   5. Model Training and Evaluation In the final and arguably most exciting step of the pipeline, the model is created and evaluated.    Here, we configure the model type and its various hyperparameters before training the model. Then, based on the validation data, the “bucket” thresholds are defined. As mentioned in part two, we defined five confidence buckets that simplified communication and understanding with users and stakeholders. These buckets range in priority from Very Low to Very High. In addition, this step produces the key metrics we’ll use to compare the models. These metrics include both generic ML metrics and Tessian Defender product-specific metrics as mentioned in part two, against each of the data splits.    Using MLFLow, we can keep track of the results of our experiments neatly, logging the hyperparameters, metrics, and even store certain artifacts that would be relevant in case we needed to reproduce the model. The interface allowed us to easily compare models based on their metrics.    Our team held a review meeting weekly to discuss the things we’ve tried and the metrics it has produced before agreeing on next steps and experiments to try. We found this practice very effective as the Data Science team rallied together to meet a deadline each week, and product managers could easily keep track of the project’s progress. During this process, we also kept in close contact with several beta users to gather quick feedback on the work-in-progress models, ensuring that the product was being developed with their needs in mind.
The improved confidence score  The new priority model was only deployed when we hit the success criteria we set out to meet.    As set out in part two, besides the many metrics such as AUC-ROC we tracked internally in order to give us direction and compare the many models, our main goal was always to optimize the users’ experience. That meant that the success criteria depended on product-centric metrics: the precision and number of quarantined emails for a client, the rate at which we could improve overall warning precision, and consistency of performance across different slices of data (time, tenants, threat types).   Based on the unseen test data, we observed a more-than-double improvement in the precision of our highest priority bucket, with our newest priority model. This improved the user experience of Tessian Defender greatly, as it meant that a security admin could now find malicious emails more easily and act on it more quickly, and that quarantining emails without compromising on users’ workflow was a possibility.
Product Impact As a Data Scientist working on a live app like Tessian Defender, rolling out a new model is always the most exciting part of the process. We get to observe the product impact of the model instantly, and get feedback through the monitoring devices we have in place, or by speaking directly with Defender customers.   As a result of the improved precision in the highest priority bucket, we unlocked the ability to quarantine with confidence. We are assured that the model is able to quarantine a significant number of threats (for all clients), massively reducing risk exposure for the company, and saving employees precious time and the burden and responsibility of discerning malicious mails, at a low rate of false positives.    We also understand that not all false positives are equal – for example, accidentally quarantining a safe newsletter has almost zero impact compared to quarantining an urgent legal document that requires immediate attention. Therefore, prior to roll-out, our team also made inquiries to quantify this inconvenience factor, ensuring that the risk of quarantining a highly important, time-sensitive email was highly unlikely. All of this meant that the benefit of turning on auto-quarantine and protecting the user from a threat far outweighs the risk of interrupting the user’s work-flow and any vital business operations. 
With this new model, Tessian Defender-triggered events are also being sorted more effectively.    Admins who log in to the Tessian portal will find the most likely malicious threats at the top, allowing them to act upon the threats instantly. Admins can quickly review the suspicious elements highlighted by Tessian Defender and gain valuable insights about the email such as: its origin  how often the sender has communicated with the organization’s users how users have responded to the warning    They can then take action such as removing the email from all users’ inboxes, or adding the sender to a denylist. Thus, even in a small team, security administrators are able to effectively respond to external threats, even in the face of a large number of malicious mails, all the while continuing to educate users in the moment on any phishy-looking emails.
Lastly, with the more robust confidence model, we are able to improve the accuracy of our warnings. By ensuring a high warning precision overall, users pay attention to every individual suspicious event, reap the full benefits of the in-situ training, and are more likely to pause and evaluate the trustworthiness of the email. As the improved confidence model is able to provide a more reliable estimate on the likelihood of an email being malicious, we are able to cut back on warning on less phishy emails that a user would learn little out of.   This concludes our 3-part series on Why Confidence Matters. Thank you for reading! We hope that this series has given you some insight into how we work here at Tessian, and the types of problems we try to solve.  To us, software and feature development is more than just endless coding and optimizing metrics in vain – we want to develop products that will actually solve peoples’ problems. If this work sounds interesting to you, we’d love for like-minded Data Scientists and Developers to join us on our mission to secure the Human Layer! Check out our open roles and apply today.   (Co-authored by Gabriel Goulet-Langlois and Cassie Quek)
Read Blog Post
Engineering Team, Life at Tessian
Engineering Spotlight: Meet Our 2021 Cohort of Associate Engineers
Monday, January 17th, 2022
We’ve believed for a long time that without finding ways to bring new talent into our industry, we’ll never overcome the lack of diversity in tech. But this only works if you can bring in diverse groups of people to begin with.
So, how did we aim to tackle this? Last year, as part of our Diversity, Equity, and Inclusion (DEO) roadmap, we kicked off a recruitment process for five new, entry level Associate Engineer positions. To widen the pool of talent, we removed some of the historical prerequisites you often see like  ‘Must have a degree in Computer Science’, and instead added ‘code-campers and career-changers welcome’ to encourage more potential great engineers to seize the opportunity.    There process represented a couple of firsts for us:    This was the first time mass recruiting and onboarding 5 candidates into the same role  We reviewed over 900 applicants, took over 300 through to the first stage, and one-on-one interviewed 53 candidates over the course of 3 weeks.Talk about Craft at Speed.   We had the opportunity to connect with so many awesome engineers and are really excited to introduce you to the 5 Tessians who officially joined us at the end of 2021.    As you’d expect, every person has a different story to tell… Meet the team: 
Nash   Nash has not one but two degrees under his belt. First he achieved a PhD in Cinema History before going on to get his MSc in Computing at Cardiff University. If that wasn’t enough, before that,  he spent two years teaching English in Japan.    Why Tessian?    “The role was much too attractive not to apply to! From the statements about the work culture, to the blogs and podcasts about the company and its mission, to the clear and impactful use cases of the product, it felt like an incredible place to start a new career. I especially loved the ‘Engineering at Tessian’ YouTube video – it really helped clarify what to expect from life in the company as a part of the engineering team.”   What’s the coolest thing you’ve done in your first month?    “While there have been lots of great moments, from Fernet Fridays to team lunches to the thoughtful and well-paced onboarding week, I would say my highlight was the first WIG (weekly interdepartmental gathering) meeting. It was great to share a room – both physically and virtually – with the whole company, to introduce myself, and hear everyone’s fun facts about themselves. I really felt like a part of the Tessian community.”
Dhruv Dhruv moved to the UK from New Delhi, India to complete his Computer Science degree at University of Manchester before moving to London to join Tessian. Although he enjoyed his time in Manchester, he loves exploring the parks and restaurants of London, as well as catching some live cricket action.    Why Tessian?    “Two things. One, because of the unique products they offer and the cutting edge technology that goes behind building them. I have a keen interest in Software Development, Machine Learning, and Natural Language Processing. Tessian effectively uses these technologies to make emails safer! And two, I feel aligned to the values and the some of the benefits stood out – Refreshian Summer, Taste of Tessian (lunch paid for every Friday), Private healthcare, and ClassPass among other things.”   What’s the coolest thing you’ve done in your first month?    “Definitely the WIG . The most fun and terrifying thing so far was introducing myself in front of the whole company and telling everyone a fun fact about myself. My fun fact was that I partly decided to go to the University of Manchester because I support Manchester United. To avoid spending all my money on tickets, I started working as a steward in the Theatre of Dreams and got paid to watch the games! This was an awesome experience that really helped me build my confidence and I got to hear some really funny stories about my colleagues.”
Rahul Rahul is currently commuting from Essex to our office in Liverpool Street. Before this, he achieved an Engineering (Information and Computer Engineering) degree at University of Cambridge.    Why Tessian?    “After connecting with Tessian, I very quickly became interested in the products and realized how essential email security really is. I’m glad I applied. From start to finish, it was probably the fastest and most efficient of the companies I applied to. Everyone was very friendly and it made me even more eager to join the team.”   What’s the coolest thing you’ve done in your first month?    “At the end of the first week, we had an Engineers social at the office. It was also the last Friday of Refreshian Summer, so the social started at lunch with pizza and drinks. Time flew by and the social went well into the evening. It was a chance to get to know a lot more people in a very relaxed way.  
Claire ​​ Not only has Claire moved countries (from Colorado to London) but she’s also made a career change. Talk about big moves! Before coming to Tessian, Claire was a project manager at a construction firm. Although she’s now switched to a more technical role, if you ever need advice on how much your house foundation will cost or if your plumber is indeed making fun of you behind your back, she’s got your back.    Why Tessian?    “I was looking for a career change.My goal was to become a software engineer and I’m particularly interested in cybersecurity and data privacy. I had to move here for the role and I came to London not knowing anyone, so it’s been great to enjoy spending time with coworkers on and off the clock. (Another plus: I’ve become a big fan of the pint and pie deal at my local pub.)”    What’s the coolest thing you’ve done in your first month?    “I’m looking forward to continuing learning in a supportive environment. My manager says  “ We create an environment where people feel supported to tackle hard projects” and I feel like that couldn’t be truer. I can’t emphazise enough that working here is truly amazing. I am also incredibly excited to connect with other women in STEM and want to become more involved in Tessian’s empowering culture!  Want to get a better idea of what Claire is working on? Check out her Day in the Life post here.
Nicholas From Switzerland to the UK, Nicholas studied Computer Science, and earned his BsC at Exeter University before completing his Masters degree at St. Andrews. From Scotland, he has now joined us in London.    Why Tessian?    “I came for the tech, and stayed for the product. When I applied, I was already pretty familiar with the languages, tools, and platforms Tessian uses. I hadn’t given email security much thought, though. But when I started to look into exactly what Tessian did, I gradually became a lot more interested in what they were building. I’ve seen misdirected emails and spear phishing attempts, and I liked what they were doing to prevent it.”   What’s the coolest thing you’ve done in your first month?    “Shortly after onboarding I got to start making changes and additions to our product. These changes were then swiftly deployed to our customers, and it was nice to see how quickly I could start working with the team to make a better product. Our team just released a new product, Architect. I look forward to working on it and making it into the best damn email filtering tool out there. Also I’m enjoying spending time with July, Claire’s dog which hangs out in the office.”
Great news! After a successful cohort in 2021, we have another five entry level positions available to join us this year. Plus we have plenty more opportunities for you to join Tessian, in Engineering, and our other teams.  Apply now. 
Read Blog Post
Engineering Team, ATO/BEC, Life at Tessian
Why Confidence Matters: How Good is Tessian Defender’s Scoring Model?
Monday, January 10th, 2022
This post is part two of Why Confidence Matters, a series about how we improved Defender’s confidence score to unlock a number of important features. You can read part one here.   In this part, we will focus on how we measured the quality of confidence scores generated by Tessian Defender. As we’ll explain later, a key consideration when deciding on metrics and setting objectives for our research was a strong focus on product outcomes.   Part 2.1 – Confidence score fundamentals   Before we jump into the particular metrics and objectives we used for the project, it’s useful to discuss the fundamental attributes that constitute a good scoring model.   1. Discriminatory power   The discriminatory power of a score tells us how good the score is at separating between positive (i.e. phishy) and negative examples (i.e. safe). The chart below illustrates this idea.    For each of two models, the image shows a histogram of the model’s predicted scores on a sample of safe and phish emails, where 0 is very sure the email is safe and 1 is absolutely certain the email is phishing.    While both are generally likely to assign a higher score for a phishing email than a safe one, the example on the left shows a clearer distinction between the most likely score for a phishing vs a safe email.
 
Discriminatory power is very important in the context of phishing because it determines how well we can differentiate between phishing and safe emails, providing a meaningful ranking of flags from most to least likely to be malicious. This confidence also unlocks the ability for Tessian Defender to quarantine emails which are likely to be phishing, and reduce flagging on emails we are least confident about, improving the precision of our warnings.  
2. Calibration Calibration is another important attribute of the confidence score. A well-calibrated score will reliably reflect the probability that a sample is positive. Calibration is normally assessed using a calibration curve, which looks at the precision of unseen samples across different confidence scores (see below).
The above graph shows two example calibration curves. The gray line shows what a perfectly calibrated model would look like: the confidence score predicted for samples (x-axis) always matches the observed proportion of phishy emails (y-axis) at that score. In contrast, the poorly-calibrated red line shows a model that is underconfident for lower scores (model predicts a lower score than the observed precision) and overconfident for high scores.   From the end-user’s perspective, calibration is especially important to make the score interpretable, and especially matters if the score will be exposed to the user.
3. Consistency  A good score will also generalize well across different cuts of the samples it applies to. For example, in the context of Tessian Defender, we needed a score that would be comparable across different types of phishing. For example, we should expect the scoring to work just as well for Account Takeover (ATO) as it does for a Brand Impersonation. We also had to make sure that the score generalized well across different customers, who operate in different industries and send and receive very different types of emails. For example, a financial services firm may receive a phishing email in the form of a spoofed financial newsletter, but such an email would not appear in the inbox of someone working in the healthcare sector.
Metrics  How do we then quantify the above attributes for a good score? This is where metrics come into play – it is important to design appropriate metrics that are technically robust, yet easily understandable and translatable to a positive user experience.   A good metric for capturing the overall discriminatory power of a model is the area under the ROC curve (AUC-ROC) or the average precision of a model at different thresholds, which capture the performance of the model across all possible thresholds. Calibration can be measured with metrics that estimate the error between the predicted score and true probability, such as the Adaptive Calibration Error (ACE).    While these out-of-the-box metrics are commonly used to assess machine learning (ML) models, there are a few challenges which make it hard to use in a business context.    First, it is quite difficult to explain simply to stakeholders who are not familiar with statistics and ML. For example, the AUC-ROC score doesn’t tell most people how well they should expect a model to behave. Second, it’s difficult to translate real product requirements into AUC-ROC scores. Even for those who understand these metrics, it’s not easy to specify what increase in these scores would be required to achieve a particular outcome for the product.
Defender product-centric metrics   While we still use AUC-ROC scores within the team and compare models by this metric, the above limitations meant that we had to also design metrics that could be understood by everyone at Tessian, and directly translatable to a user’s product feature experience.    First, we defined five simpler-to-understand priority buckets that were easier to communicate with stakeholders and users (from Very Low to Very High). We aimed to be able to quarantine emails in the highest priority bucket, so we calibrated each bucket to the probability of an email being malicious. This makes each bucket intuitive to understand, and allows us to clearly translate to our users’ experience of the quarantine feature.    For the feature to be effective, we also defined a minimum number of malicious emails to prevent reaching the inbox, as a percentage of the company’s inbound email traffic. Keeping track of this metric prevents us from over-optimizing the accuracy of the Very-High bucket at the expense of capturing most of the malicious emails (recall), which would greatly limit the feature’s usefulness.   While good precision in the highest confidence bucket is important, so is accuracy on the lower end of the confidence spectrum.    A robust lower end score will allow us to stop warning on emails we are not confident in, unlocking improvements in overall precision to the Defender algorithm. Hence, we also set targets for accuracy amongst emails in the Very-Low/Low buckets.    For assurance of consistency, the success of this project also depended on achieving the above metrics across slices of data – the scores would have to be good across the different email threat types we detect, and different clients who use Tessian Defender.
Part 2.2 – Our Data: Leveraging User Feedback After identifying the metrics, we can now look at the data we used to train and benchmark our improvements to the confidence score.Having the right data is key to any ML application, and this is particularly difficult for phishing detection. Specifically, most ML applications rely on labelled datasets to learn from.    We found building a labelled dataset of phishing and non-phishing emails especially challenging for a few reasons:
Data challenges Phishing is a highly imbalanced problem. On the whole, phishing emails are extremely low in volumes compared to all other legitimate email transactions for the average user. On a daily basis, over 300 billion emails are being sent and received around the world, according to recent statistics. This means that efforts to try to label emails manually will be highly ineffective, like finding a needle in a haystack.   Also, phishing threats and techniques are constantly evolving, such that even thousands of emails labelled today would quickly become obsolete. The datasets we use to train phishing detection models must constantly be updated to reflect new types of attacks.   Email data is also very sensitive by nature. Our clients trust us to process their emails, many of which contain sensitive data, in a very secure manner.  For good reasons, this means we control who can access email data very strictly, which makes labelling harder.    All these challenges make it quite difficult to collect large amounts of labelled data to train end-to-end ML models to detect phishing.
User feedback and why it’s so useful   As you may remember from part one of this series, end-users have the ability to provide feedback about Tessian Defender warnings. We collect thousands of these user responses weekly, providing us with invaluable data about phishing.   User responses help address a number of the challenges mentioned above.    First, they provide a continually updated view of changes in the attack landscape. Unlike a static email dataset labelled at a particular point in time, user response labels can capture information about the latest phishing trends as we collect them, day-in and day-out. With each iteration of model retraining with the newest user labels, user feedback is automatically incorporated into the product. This creates a positive feedback loop, allowing the product to evolve in response to users’ needs.   Relying on end-users to label their own emails also helps alleviate concerns related to data sensitivity and security. In addition, end-users also have the most context about the particular emails they receive. Combined with explanations provided by Tessian warnings, they are more likely to provide accurate feedback.    These benefits address all the previous challenges we faced neatly, but it is not without its limitations.    For one, the difference between phishing, spam and graymail is not always clear to users, causing spam and graymail to often be labelled as malicious. Often, several recipients of the same email can also disagree on whether it is malicious. Secondly, user feedback data may not be a uniform representation of the email threat landscape – we often receive more feedback from some clients or certain types of phishing. Neglecting to address this under-representation would result in a model that performs better for some clients, something we absolutely need to avoid in order to ensure consistency in the quality of our product for all new and existing clients.   In the last part of the series Why Confidence Matters, we’ll discuss how we navigated the above challenges, delve deeper into the technical design of the research pipeline used to build the confidence-scoring model, and the impact that this has brought to our customers.
(Co-authored by Gabriel Goulet-Langlois and Cassie Quek)
Read Blog Post
Engineering Team, Integrated Cloud Email Security, ATO/BEC, Life at Tessian
Why Confidence Matters: How We Improved Defender’s Confidence Scores to Fight Phishing Attacks
Tuesday, January 4th, 2022
‘Why Confidence Matters’ is a weekly three-part series. In this first article, we’ll explore why a reliable confidence score is important for our users. In part two, we’ll explain more about how we measured improvements in our scores using responses from our users. And finally, in part three, we’ll go over the pipeline we used to test different approaches and the resulting impact in production.   Part One: Why Confidence Matters   Across many applications of machine learning (ML), being able to quantify the uncertainty associated with the prediction of a model is almost as important as the prediction itself.    Take, for example, chatbots designed to resolve customer support queries. A bot which provides an answer when it is very uncertain about it, will likely cause confusion and dissatisfied users. In contrast, a bot that can quantify its own uncertainty, admit it doesn’t understand a question, and ask for clarification is much less likely to generate nonsense messages and cause frustration amongst its users.
The importance of quantifying uncertainty   Almost no ML model gets every prediction right every time – there’s always some uncertainty associated with a prediction. For many product features, the cost of errors can be quite high. For example, mis-labelling an important email as phishing and quarantining it could result in a customer missing a crucial invoice, or mislabelling a bank transaction as fraudulent could result in an abandoned purchase for an online merchant.      Hence, ML models that make critical decisions need to predict two key pieces of information: 1. the best answer to provide a user 2. a confidence score to quantify uncertainty about the answer. Quantifying the uncertainty associated with a prediction can help us to decide if, and what actions should be taken.
How does Tessian Defender work?   Every day, Tessian Defender checks millions of emails to prevent phishing and spear phishing attacks. In order to maximise coverage,  Defender is made up of multiple machine learning models, each contributing to the detection of a particular type of email threat (see our other posts on phishing, spear phishing, and account takeover).      Each model identifies phishing emails based on signals relevant to the specific type of attack it targets. Then, beyond this primary binary classification task, Defender also generates two key outputs for any email that is identified as potentially malicious across any of the models:   A confidence score, which is related to the probability that the email flagged is actually a phishing attack. This score is a value between 0 (most likely safe) and 1 (most certainly phishing), which is then broken down into 4 categories of Priority (from Low to Very High). This score is important for various reasons, which we further expand on in the next section. An explanation of why Defender flagged the email. This is an integral part of Tessian’s approach to Human Layer Security: we aim not only to detect phishy emails, but also to educate users in-the-moment so they can continually get better at spotting future phishing emails. In the banner, we aim to concisely explain the type of email attack, as well as why Defender thinks it is suspicious. Users who see these emails can then provide feedback about whether they think the email is indeed malicious or not. Developing explainable AI is a super interesting challenge which probably deserves its own content, so we won’t focus on it in this particular series. Watch this space!   
Why Confidence Scores Matters    Beyond Defender’s capability to warn on suspicious emails, there were several key product features we wanted to unlock for our customers that could only be done with a robust confidence score. These were: Email quarantine Based on the score, Defender first aims to quarantine the highest priority emails to prevent malicious emails from ever reaching their employees’ mailboxes. This not only reduces the risk exposure for the company from an employee still potentially interacting with a malicious email; it also removes burden and responsibility from the user to make a decision, and reduces interruption to their work.   Therefore, for malicious emails that we’re most confident about, quarantining is extremely useful. In order for quarantine to work effectively, we must:   Identify malicious emails with very high precision (i.e. very few false positives). We understand the reliance of our customers on emails to conduct their business, and so we needed to make sure that any important communications must still come through to their inboxes unimpeded. This was very important so that Tessian’s Defender can secure the human layer without security getting in our user’s way. Identify a large enough subset of high confidence emails to quarantine. It would be easy to achieve a very high precision by quarantining very few emails with a very high score (a low recall), but this would greatly limit the impact of quarantine on how many threats we can prevent. In order to be a useful tool, Defender would need to quarantine a sizable volume of malicious emails.   Both these objectives directly depend on the quality of the confidence score. A good score would allow for a large proportion of flags to be quarantined with high precision.
Prioritizing phishy emails In today’s threat landscape, suspicious emails come into inboxes in large volumes, with varying levels of importance. That means it’s critical to provide security admins who review these flagged emails with a meaningful way to order and prioritize the ones that they need to act upon. A good score will provide a useful ranking of these emails, from most to least likely to be malicious, ensuring that an admin’s limited time is focused on mitigating the most likely threats, while having the assurance that Defender continues to warn and educate users on other emails that contain suspicious elements.   The bottom line: Being able to prioritize emails makes Defender a much more intelligent tool that is effective at improving workflows and saving our customers time, by drawing their attention to where it is most needed.  
Removing false positives We want to make sure that all warnings Tessian Defender shows employees are relevant and help prevent real attacks.    False positives occur when Defender warns on a safe email. If this happens too often, warnings could become a distraction, which could have a big impact on productivity for both security admins and email users. Beyond a certain point, a high false positive rate could mean that warnings lose their effectiveness altogether, as users may ignore it completely. Being aware of these risks, we take extra care to minimize the number of false positives flagged by Defender.    Similarly to quarantine, a good confidence score can be used to filter out false positives without impacting the number of malicious emails detected. For example, emails with a confidence score below a given threshold could be removed to avoid showing employees unnecessary warnings.
What’s next?   Overall, you can see there were plenty of important use cases for improving Tessian Defender’s confidence score. The next thing we had to do was to look at how we could measure any improvements to the score. You can find a link to part two in the series below (Co-authored by Gabriel Goulet-Langlois and Cassie Quek)
Read Blog Post
Engineering Team
A Solution to HTTP 502 Errors with AWS ALB
by Samson Danziger Friday, October 1st, 2021
At Tessian, we have many applications that interact with each other using REST APIs. We noticed in the logs that at random times, uncorrelated with traffic, and seemingly unrelated to any code we had actually written, we were getting a lot of HTTP 502 “Bad Gateway” errors.   Now that the issue is fixed, I wanted to explain what this error means, how you get it and how to solve it. My hope is that if you’re having to solve this same issue, this article will explain why and what to do.    First, let’s talk about load balancing
In a development system, you usually run one instance of a server and you communicate directly with it. You send HTTP requests to it, it returns responses, everything is golden.    For a production system running at any non-trivial scale, this doesn’t work. Why? Because the amount of traffic going to the server is much greater, and you need it to not fall over even if there are tens of thousands of users.    Typically, servers have a maximum number of connections they can support. If it goes over this number, new people can’t connect, and you have to wait until a new connection is freed up. In the old days, the solution might have been to have a bigger machine, with more resources, and more available connections.   Now we use a load balancer to manage connections from the client to multiple instances of the server. The load balancer sits in the middle and routes client requests to any available server that can handle them in a pool.    If one server goes down, traffic is automatically routed to one of the others in the pool. If a new server is added, traffic is automatically routed to that, too. This all happens to reduce load on the others.
What are 502 errors? On the web, there are a variety of HTTP status codes that are sent in response to requests to let the user know what happened. Some might be pretty familiar:   200 OK – Everything is fine. 301 Moved Permanently – I don’t have what you’re looking for, try here instead.  403 Forbidden – I understand what you’re looking for, but you’re not allowed here. 404 Not Found – I can’t find whatever you’re looking for. 503 Service Unavailable – I can’t handle the request right now, probably too busy. 4xx and 5xx both deal with errors. 4xx are for client errors, where the user has done something wrong. 5xx, on the other hand, are server errors, where something is wrong on the server and it’s not your fault.    All of these are specified by a standard called RFC7231. For 502 it says:   The 502 (Bad Gateway) status code indicates that the server, while acting as a gateway or proxy, received an invalid response from an inbound server it accessed while attempting to fulfill the request.   The load balancer sits in the middle, between the client and the actual service you want to talk to. Usually it acts as a dutiful messenger passing requests and responses back and forth. But, if the service returns an invalid or malformed response, instead of returning that nonsensical information to the client, it sends back a 502 error instead.   This lets the client know that the response the load balancer received was invalid.
The actual issue   Adam Crowder has done a full analysis of this problem by tracking it all the way down to TCP packet capture to assess what’s going wrong. That’s a bit out of scope for this post, but here’s a brief summary of what’s happening:    At Tessian, we have lots of interconnected services. Some of them have Application Load Balancers (ALBs) managing the connections to them.   In order to make an HTTP request, we must open a TCP socket from the client to the server. Opening a socket involves performing a three-way handshake with the server before either side can send any data.   Once we’ve finished sending data, the socket is closed with a 4 step process. These 3 and 4 step processes can be a large overhead when not much actual data is sent.  Instead of opening and then closing one socket per HTTP request, we can keep a socket open for longer and reuse it for multiple HTTP requests. This is called HTTP Keep-Alive. Either the client or the server can then initiate a close of the socket with a FIN segment (either for fun or due to timeout).
The 502 Bad Gateway error is caused when the ALB sends a request to a service at the same time that the service closes the connection by sending the FIN segment to the ALB socket. The ALB socket receives FIN, acknowledges, and starts a new handshake procedure.   Meanwhile, the socket on the service side has just received a data request referencing the previous (now closed) connection. Because it can’t handle it, it sends an RST segment back to the ALB, and then the ALB returns a 502 to the user.   The diagram and table below show what happens between sockets of the ALB and the Server.
How to fix 502 errors   It’s fairly simple. Just make sure that the service doesn’t send the FIN segment before the ALB sends a FIN segment to the service. In other words, make sure the service doesn’t close the HTTP Keep-Alive connection before the ALB.    The default timeout for the AWS Application Load Balancer is 60 seconds, so we changed the service timeouts to 65 seconds. Barring two hiccups shortly after deploying, this has totally fixed it.   The actual configuration change   I have included the configuration for common Python and Node server frameworks below. If you are using any of those, you can just copy and paste. If not, these should at least point you in the right direction.  uWSGI (Python) As a config file: # app.ini [uwsgi] ... harakiri = 65 add-header = Connection: Keep-Alive http-keepalive = 1 ... Or as command line arguments: --add-header "Connection: Keep-Alive" --http-keepalive --harakiri 65 Gunicorn (Python) As command line arguments: --keep-alive 65 Express (Node) In Express, specify the time in milliseconds on the server object. const express = require('express'); const app = express(); const server = app.listen(80); server.keepAliveTimeout = 65000
Looking for more tips from engineers and other cybersecurity news? Keep up with our blog and follow us on LinkedIn.
Read Blog Post
Engineering Team
Tessian’s CSI QA Journey: WinAppDriver, Office Apps, and Sessions
by Tessian Wednesday, June 30th, 2021
Introduction In part one, we went over the decisions that led the CSI team to start automating its UI application with a focus on the process drivers and journey.  Today we’re going to start going over the technical challenges, solutions, and learnings along the way.  It would be good if you had a bit of understanding of how to use WinAppDriver for UI testing.  As there are a multitude of beginner tutorials, this post will be more in depth. All code samples are available as a complete solution here. How We Got Here As I’m sure many others have done before, we started by adapting winappdriver samples into our own code base.  After we had about 20 tests up and running, it became clear that taking some time to better architect common operations would help in fixing tests as we targeted more versions of Outlook, Windows, etc.  Simple things like how long to wait for a window to open, or how long to wait to receive an email can be impacted by the test environment, and it quickly becomes tedious to change these in 20 different places whenever we have a new understanding/solution on the best way to do these operations. Application Sessions A good place to start when writing UI tests is just getting the tests to open the application.  There are plenty of samples online that show you how to do this, but there are a few things that the samples leave each of us to solve on our own that I think would be helpful to share with the larger Internet community. All Application Sessions are Pretty Similar And when code keeps repeating itself, it’s time to abstract this code into interfaces and classes.  So, we have both: an interface and a base class:
Don’t worry, we’ll get into the bits.  The main point of this class is it pertains to starting/stopping, or attaching/detaching to applications and that we’re storing enough information about the application under test to do those operations.   In the constructor, the name of the process is used to determine if we can attach to an already running process, whereas the path to the executable is used if we don’t find a running process and need to start a fresh instance.  The process name can be found in the Task Manager’s Details tab. Your Tests Should Run WinAppDriver I can’t tell you how many times I’ve clicked run on my tests only to have them all fail because I forgot to start the WinAppDriver process beforehand.  WinAppDriver is the application that drives the mouse and keyboard clicks, along with getting element IDs, names, classes, etc of the application under test.  Using the same solution WinAppDriver’s examples show for starting any application, you can start the WinAppDriver process as well.   Using IManageSession and BaseSession<T> above, we get:
The default constructor just calls BaseSession<WinAppDriverProcess> with the name of the process and the path to the executable. So you can see that StartSession here is implemented to be thread safe.  This ensures that only one instance can be created in a test session, and that it’s created safely in an environment where you run your tests across multiple threads.  It then queries the base class about whether the application you’re starting is already running or not.  If it is running, we attach to it.  If it’s not, we start a new instance and attach to that.  Here are those methods:
These are both named Unsafe to show that they’re not thread safe, and it’s up to the calling method to ensure thread safety.  In this case, that’s StartSession(). And for completeness, StopSession does something very similar except it queries BaseSession<T> to see if we own the process (i.e. it was started as a fresh instance and not attached to), or not.  If we own it, then we’re responsible for shutting it down, but if we only attach to it, then leave it open.
You’ll Probably Want a DesktopSession Desktop sessions can be useful ways to test elements from the root of the Windows Desktop.  This would include things like the Start Menu, sys-tray, or file explorer windows.  We use it for our sys-tray icon functionality, but regardless of what you need it for, WinAppDriver’s FAQ provides the details, but I’ve made it work here using IManageSession and BaseSession<T>:
It’s a lot simpler since we’d never be required to start the root session.  It’s still helpful to have it inherit from BaseSession<T> as that will provide us some base functionality like storing the instance in a Singleton and knowing how long to wait for windows to appear when switching to/from them. Sessions for Applications with Splash Screens This includes all the Office applications.  WinAppDriver’s FAQ has some help on this, but I think I’ve improved it a bit with the do/while loop to wait for the main window to appear.  The other methods look similar to the above, so I’ve collapsed them for brevity.
Putting it All Together So how do we put all this together and make a test run?  Glad you asked! NUnit I make fairly heavy use of NUnit’s class and method level attributes to ensure things get set up correctly depending on the assembly, namespace, or class a test is run in.  Mainly, I have a OneTimeSetup for the whole assembly that starts WinAppDriver and attaches to the Desktop root session.  
Then I separate my tests into namespaces that correspond to the application under test – in this case, it’s Outlook.  
I then use a OneTimeSetup in that namespace that starts Outlook (or attaches to it). 
Finally, I use SetUp and TearDown attributes on the test classes to ensure I start and end each test from the main application window.
The Test All that allows you to write (the somewhat verbose) test:
Wrapping It All Up For this post we went into the details on how to organize and code your Sessions for UI testing.  We showed you how to design them so you can reuse code between different application sessions.  We also enabled them to either start the application or connect to an already running application instance (and how the Session object can determine which to do itself).  Finally, we put it all together and created a basic test that drives Outlook’s UI to compose a new Email message and send it. Stay tuned for the next post where we’ll delve into how to handle all the dialog windows your UI needs – to interact with and abstract that away – so you can write a full test with something that looks like this:
Read Blog Post
Engineering Team, Life at Tessian
React Hooks at Tessian
by Luke Barnard Wednesday, June 16th, 2021
I’d like to describe Tessian’s journey with React hooks so far, covering some technical aspects as we go. About two years ago, some of the Frontend guild at Tessian were getting very excited about a new React feature that was being made available in an upcoming version: React Hooks. React Hooks are a very powerful way to encapsulate state within a React app. In the words of the original blog post, they make it possible to share stateful logic between multiple components. Much like React components, they can be composed to create more powerful hooks that combine multiple different stateful aspects of an application together in one place. So why were we so excited about the possibilities that these hooks could bring? The answer could be found in the way we were writing features before hooks came along. Every time we wrote a feature, we would have to write extra “boilerplate” code using what was, at some point, considered by the React community to be the de facto method for managing state within a React app ─ Redux. As well as Redux, we depended on Redux Sagas, a popular library for implementing asynchronous functionality within the confines of Redux. Combined, these two(!) libraries gave us the foundation upon which to do…very simple things, mostly API requests, handling responses, tracking loading and error states for each API that our app interacted with. The overhead of working in this way showed each feature required a new set of sagas, reducers, actions and of course the UI itself, not to mention the tests for each of these. This would often come up as a talking point when deciding how long a certain task would take during a sprint planning session. Of course there were some benefits in being able to isolate each aspect of every feature. Redux and Redux Sagas are both well-known for being easy to test, making testing of state changes and asynchronous API interactions very straight-forward and very ─if not entirely─ predictable. But there are other ways to keep testing important parts of code, even when hooks get involved (more on that another time). Also, I think it’s important to note that there are ways of using Redux Sagas without maintaining a lot of boilerplate, e.g. by using a generic saga, reducer and actions to handle all API requests. This would still require certain components to be connected to the Redux store, which is not impossible but might encourage prop-drilling. In the end, everyone agreed that the pattern we were using didn’t suit our needs, so we decided to introduce hooks to the app, specifically for new feature development. We also agreed that changing everything all at once in a field where paradigms fall into and out of fashion rather quickly was a bad idea. So we settled on a compromise where we would gradually introduce small pieces of functionality to test the waters. I’d like to introduce some examples of hooks that we use at Tessian to illustrate our journey with them. Tessian’s first hook: usePortal Our first hook was usePortal. The idea behind the hook was to take any component and insert it into a React Portal. This is particularly useful where the UI is shown “above” everything else on the page, such as dialog boxes and modals. The documentation for React Portals recommends using a React Class Component, using the lifecycle methods to instantiate and tear-down the portal as the component mounts/unmounts. Knowing we could achieve the same thing with hooks, we wrote a hook that would handle this functionality and encapsulate it, ready to be reused by our myriad of modals, dialog boxes and popouts across the Tessian portal. The gist of the hook is something like this:
Note that the hook returns a function that can be treated as a React component. This pattern is reminiscent of React HOCs, which are typically used to share concerns across multiple components. Hooks enable something similar but instead of creating a new class of component, usePortal can be used by any (function) component. This added flexibility gives hooks an advantage over HOCs in these sorts of situations. Anyway, the hook itself is very simple in nature, but what it enables is awesome! Here’s an example of how usePortal can be used to give a modal component its own portal:
Just look at how clean that is! One line of code for an infinite amount of behind-the-scenes complexity including side-effects and asynchronous behaviors! It would be an understatement to say that at this point, the entire team was hooked on hooks!   Tessian’s hooks, two months later Two months later we wrote hooks for interacting with our APIs. We were already using Axios as our HTTP request library and we had a good idea of our requirements for pretty much any API interaction. We wanted: To be able to specify anything accepted by the Axios library To be able to access the latest data returned from the API To have an indication of whether an error had occurred and whether a request was ongoing Our real useFetch hook has since become a bit more complicated but to begin with, it looked something like this:
To compare this to the amount of code we would have to write for Redux sagas, reducers and actions, there’s no comparison. This hook clearly encapsulated a key functionality that we have since gone on to use dozens of times in dozens of new features. From here on out, hooks were here to stay in the Tessian portal, and we decided to phase out Redux for use in features. Today there are 72 places where we’ve used this hook or its derivatives ─ that’s 72 times we haven’t had to write any sagas, reducers or actions to manage API requests! Tessian’s hooks in 2021 I’d like to conclude with one of our more recent additions to our growing family of hooks. Created by our resident “hook hacker”, João, this hook encapsulates a very common UX paradigm seen in basically every app. It’s called useSave. The experience is as follows: The user is presented with a form or a set of controls that can be used to alter the state of some object or document in the system. When a change is made, the object is considered “edited” and must be “saved” by the user in order for the changes to persist and take effect. Changes can also be “discarded” such that the form returns to the initial state. The user should be prompted when navigating away from the page or closing the page to prevent them from losing any unsaved changes. When the changes are in the process of being saved, the controls should be disabled and there should be some indication to let the user know that: (a) the changes are being saved, (b) the changes have been saved successfully, or that (c) there was an error with their submission. Each of these aspects require the use of a few different native hooks: A hook to track the object data with the user’s changes (useState) A hook to save the object data on the server and expose the current object data (useFetch) A hook to update the tracked object data when a save is successful (useEffect) A hook to prevent the window from closing/navigating if changes haven’t been saved yet (useEffect) Here’s a simplified version:
As you can see, the code is fairly concise and more importantly it makes no mention of any UI component. This separation means we can use this hook in any part of our app using any of our existing UI components (whether old or new). An exercise for the reader: see if you can change the hook above so that it exposes a textual label to indicate the current state of the saved object. For example if isLoading is true, maybe the label could indicate “Saving changes…” or if hasChanges is true, the text could read “Click ‘Save’ to save changes”. Tessian is hiring! Thanks for following me on this wild hook-based journey, I hope you found it enlightening or inspiring in some way. If you’re interested in working with other engineers that are super motivated to write code that can empower others to implement awesome features, you’re in luck! Tessian is hiring for a range of different roles, so connect with me on LinkedIn, and I can refer you!
Read Blog Post
Engineering Team
After 18 Months of Engineering OKRs, What Have We Learned?
by Andy Smith Thursday, June 3rd, 2021
We have been using OKRs (Objectives and Key Results) at Tessian for over 18 months now, including in the Engineering team. They’ve grown into an essential part of the organizational fabric of the department, but it wasn’t always this way. In this article I will share a few of the challenges we’ve faced, lessons we’ve learned and some of the solutions that have worked for us. I won’t try and sell you on OKRs or explain what an OKR is or how they work exactly; there’s lots of great content that already does this! Getting started When we introduced OKRs, there were about 30 people in the Engineering department. The complexity of the team was just reaching the tipping point where planning becomes necessary to operate effectively. We had never really needed to plan before, so we found OKR setting quite challenging, and we found ourselves taking a long time to set what turned out to be bad OKRs. It was tempting to think that this pain was caused by OKRs themselves. On reflection today, however, it’s clear that OKRs were merely surfacing an existing pain that would emerge at some point anyway. If teams can’t agree on an OKR, they’re probably not aligned about what they are working on. OKRs surfaced this misalignment and caused a little pain during the setting process that prevented a large pain during the quarter when the misalignment would have had a larger impact. The Key Result part of an OKR is supposed to describe the intended outcome in a specific and measurable way. This is sometimes straightforward, typically when a very clear metric is used, such as revenue or latency or uptime. However, in Engineering there are often KRs that are very hard to write well. It’s too easy to end up with a bunch of KRs that drive us to ship a bunch of features on time, but have no aspect of quality or impact. The other pitfall is aiming for a very measurable outcome that is based on a guess, which is what happens when there is no baseline to work from. Again, these challenges exist without OKRs, but they may never precipitate into the conversation about what a good outcome is for a particular deliverable without OKRs there to make it happen. Unfortunately we haven’t found the magic wand that makes this easy, and we still have some binary “deliver the feature” key results every quarter, but these are less frequent now. We will often set a KR to ship a feature in Q1 and to set up a metric and will then set a target for the metric in Q2 once we have a baseline. Or if we have a lot of delivery KRs, we’ll pull them out of OKRs altogether and zoom out to set the KR around their overall impact. An eternal debate in the OKR world is whether to set OKRs top-down (leadership dictate the OKRs and teams/individuals fill out the details) or bottom-up (leadership aggregates the OKRs of teams and individuals into something coherent) or some mixture of the two. We use a blend of the two, and will draft department OKRs as a leadership team and then iterate a lot with teams, sometimes changing them entirely. This takes time, though. Every iteration uncovers misalignment, capacity, stakeholder or research issues that need to be addressed. We’ve sometimes been frustrated and rushed this through as it feels like a waste of time, but when we’ve done this, we’ve just ended up with bigger problems later down the road that are harder to solve than setting decent OKRs in the first place. The lesson we’ve learned is that effort, engagement with teams and old-fashioned rigor are required when setting OKRs, so we budget 3-4 weeks for the whole process. Putting OKRs into Practice The last three points have all been about setting OKRs, but what about actually using them day to day? We’ve learned two things:  the importance of allowing a little flex, and  how frequent – but light – process is needed to get the most out of your OKRs First, flex. Our OKRs are quarterly, but sometimes we need to set a 6 month OKR because it just makes more sense! We encourage this to happen. We don’t obsess about making OKRs ladder up perfectly to higher-level OKRs. It’s nice when they do, but if this is a strict requirement, then we find that it’s hard to make OKRs that actually reflect the priorities of the quarter. Sometimes a month into the quarter, we realize we set a bad OKR or wrote it in the wrong way. A bit of flexibility here is important, but not too much. It’s important to learn from planning failures, but it is probably more important that OKRs reflect teams’ actual priorities and goals, or nobody is going to take them seriously. So tweak that metric or cancel that OKR if you really need to, but don’t go wild. Finally, process. If we don’t actively check in on OKRs weekly, we tend to find that all the value we get from OKRs is diluted. Course-corrections come too late or worries go unsolved for too long. To keep this sustainable, we do this very quickly. I have an OKR check-in on the agenda for all my 1-1s with direct reports, and we run a 15-minute group meeting every week with the Product team where each OKR owner flags any OKRs that are off track, and we work out what we need to do to resolve them. Often this causes us to open a slack channel or draft a document to solve the issue outside of the meeting so that we stick to the strict 15 minute time slot. Many of these lessons have come from suggestions from the team, so my final tip is that if you’re embarking on using OKRs in your Engineering team, or if you need to get them back on track, make sure you set some time aside to run a retrospective. This invites your leaders and managers to think about the mechanics of OKRs and planning, and they usually have the best ideas on how to improve things.
Read Blog Post
Engineering Team
Tessian’s Client Side Integrations QA Journey – Part I
by Craig Callender Thursday, May 20th, 2021
In this series, we’re going to go over the Quality Assurance journey we’ve been on here in the Client Side Integrations (CSI) team at Tessian. Most of this post will be using our experience with the Outlook Add-in, as that’s the piece of software most used by our clients. But the philosophies and learnings here apply to most software in general (regardless of where it’s run) with an emphasis on software that includes a UI. I’ll admit that the onus for this work was me sitting in my home office Saturday morning knowing that I’d have to start manual testing for an upcoming release in the next two weeks and just not being able to convince myself to click the “Send” button in Outlook after typing a subject of “Hello world” one more time… But once you start automating UI tests, it just builds on itself and you start being able to construct new tests from existing code. It can be such a euphoric experience. If you’ve ever dreaded (or are currently dreading) running through manual QA tests, keep reading and see if you can implement some of the solutions we have. Why QA in the Outlook Add-in Ecosystem is Hard The Outlook Add-in was the first piece of software written to run on our clients’ computers and, as a result of this, alongside needing to work in some of the oldest software Microsoft develops (Outlook), there are challenges when it comes to QA. These challenges include: Detecting faults in the add-in itself Detecting changes in Outlook which may result in functionality loss of our add-in Detecting changes in Windows that may result in performance issues of our add-in Testing the myriad of environments our add-in will be installed in The last point is the hardest to QA, as even a list of just a subset of the different configurations of Outlook shows the permutations of test environments just doesn’t scale well: Outlook’s Online vs Cached mode Outlook edition: 2010, 2013, 2016, 2019 perpetual license, 2019 volume license, M365 with its 5 update channels… Connected to On-Premise Exchange Vs Exchange Online/M365 Other add-ins in Outlook Third-party Exchange add-ins (Retention software, auditing, archiving, etc…) And now add non-Outlook related environment issues we’ve had to work through: Network proxies, VPNs, Endpoint protection Virus scanning software Windows versions One can see how it would be impossible to predict all the environmental configurations and validate our add-in functionality before releasing it. A Brief History of QA in Client Side Integrations (CSI) As many companies do – we started our QA journey with a QA team.  This was a set of individuals whose full time job was to install the latest commit of our add-in and test its functionality. This quickly grew where this team was managing/sharing VMs to ensure we worked on all those permutations above. They also worked hard to try and emulate the external factors of our clients’ environments like proxy servers, weak Internet connections, etc… This model works well for exploratory testing and finding strange edge cases, but where it doesn’t work well or scale well, is around communication (the person seeing the bug isn’t the person fixing the bug) and automation (every release takes more and more person-power as the list of regression issues gets longer and longer). In 2020 Andy Smith, our Head of Engineering, made a commitment that all QA in Tessian would be performed by Developers. This had a large impact on the CSI team as we test an external application (Outlook) across many different versions and configurations which can affect its UI. So CSI set out a three phase approach for the Development team to absorb the QA processes. (Watch how good we are at naming things in my team.) Short-Term The basic goal here was that the Developers would run through the same steps and processes that were already defined for our QA.  This meant a lot of manual testing, manually configuring environments, etc. The biggest learning from our team during this phase was that there needs to be a Developer on an overview board whenever you have a QA department to ensure that test steps actually test the thing you want. We found many instances where an assumption in a test step was made that was incorrect or didn’t fully test something. Medium-Term The idea here was that once the Developers are familiar and comfortable running through the processes defined by the QA department, we would then take over ownership of the actual tests themselves and make edits. Often these edits resulted in the ability to test a functionality with more accuracy or less steps. It also included the ability to stand up an environment that tests more permutations, etc. The ownership of the actual tests also meant that as we changed the steps, we needed to do it with automation in mind. Long-Term Automation. Whether it’s unit, integration, or UI tests, we need them automated. Let a computer run the same test over and over again let the Developers think of ever increasing complexity of what and how we test. Our QA Philosophy Because it would be impossible for us to test every permutation of potential clients’ environments before we release our software (or even an existing client’s environment), we approach our QA with the following philosophies: Software Engineers are in the Best Position to Perform QA This means that the people responsible for developing the feature or bug, are the best people when it comes to writing the test cases needed to validate the change, add those test cases to a release cycle, and to even run the test itself.  The why’s of this could be (and probably will be) a whole post. 🙂 Bugs Will Happen We’re human. We’ll miss something we shouldn’t have. We won’t think of something we should have.  On top of that, we’ll see something we didn’t even think was a possibility. So be kind and focus on the solution rather than the bad code commit. More Confidence, Quicker Our QA processes are here to give us more confidence in our software as quickly as possible, so we can release features or fixes to our clients. Whether we’re adding, editing, or removing a step in our QA process, we ask ourselves if doing this will bring more confidence to our release cycle or will it speed it up.  Sometimes we have to make trade-offs between the two. Never Release the Same Bug Twice Our QA process should be about preventing regressions on past issues just as much as it is about confirming functionality of new features. We want a robust enough process that when an issue is encountered and solved once, that same issue is never found again.  In the least, this would mean we’d never have the same bug with the same root cause again.  At most it would mean that we never see the same type of bug again, as a root cause could be different even though the loss in functionality is the same. An example of this last point is that if our team works through an issue where a virus scanner is preventing us from saving an attachment to disk, we should have a robust enough test that will also detect this same loss in functionality (the inability to save an attachment to disk) for any cause (for example, a change to how Outlook allows access to the attachment, or another add-in stubbing the attachment to zero-bytes for archiving purposes, etc…) How Did We Do? We started this journey with a handful of unit tests that were all automated in our CI environment.   Short-Term Phase During the Short-Term phase, there was an emphasis on new commits ensuring that we had unit tests alongside them.  Did we sometimes make a decision to release a feature with only manual tests because the code base didn’t lend itself to unit testability? YES! But we worked hard to always ensure we had a good reason for excluding unit tests instead of just assuming it couldn’t be done because it hadn’t before. Being flexible, while at the same time keeping your long-term goal in mind is key, and at times, challenging. Medium-Term This phase wasn’t made up of nearly as much test re-writing as we had intentionally set out for.  We added a section to our pull requests to include links to any manual testing steps required to test the new code. This resulted in more new, manual tests being written by developers than edits to existing ones. We did notice that the quality of tests changed.  It’s tempting to say, “for the better”, “or with better efficiency”, but I believe most of the change can more be attributed to an understanding that the tests were now being written for a different audience, namely Developers.  They became a bit more abstract and a bit more technical.  Less intuitive. They also became a bit more verbose as we get a bad taste in our mouth whenever we see a manual step that says something like, “Trigger an EnforcerFilter” with no description on which one? One that displays something to the user or just the admin? Etc…. This phase was also much shorter than we had originally thought it would be. Long-Term This was my favorite phase.  I’m actually one of those software engineers that LOVE writing unit tests. I will attribute this to JetBrains’ ReSharper (I could write about my love of ReSharper all day) interface which gives me oh-so-satisfying green checkmarks as my tests run… I love seeing more and more green checkmarks! We had arranged a long term OKR with Andy, which gave us three quarters in 2021 to implement automation of three of our major modules (Constructor, Enforcer, and Guardian)— with a stretch goal of getting one test working for our fourth major module, Defender.  We blew this out of the water and met them all (including a beta module – Architect) in one quarter.  It was addicting writing UI tests and watching the keyboard and mouse move on its own. Wrapping it All Up Like many software product companies large and small, Tessian started out with a manual QA department composed of technologists but not Software Engineers.  Along the way, we made the decision that Software Engineers need to own the QA of the software they work on. This led us on a journey, which included closer reviews of existing tests, writing new tests, and finally automating a majority of our tests. All of this combined allows us to release our software with more confidence, more quickly. Stay tuned for articles where we go into details about the actual automation of UI tests and get our hands dirty with some fun code.
Read Blog Post
Engineering Team
How Do You Encrypt PySpark Exceptions?
by Vladimir Mazlov Friday, May 14th, 2021
We at Tessian are very passionate about the safety of our customers. We constantly handle sensitive email data to improve the quality of our protection against misdirected emails, exfiltration attempts, spear phishing etc. This means that many of our applications handle data that we can’t afford to have leaked or compromised.   As part of our efforts to keep customer data safe, we take care to encrypt any exceptions we log, as you never know when a variable that has the wrong type happens to contain an email address. This approach allows us to be liberal with the access we give to the logs, while giving us comfort that customer data won’t end up in them. Spark applications are no exception to this rule, however, implementing encryption for them turned out to be quite a journey.   So let us be your guide on this journey. This is a tale of despair, of betrayal and of triumph. It is a tale of PySpark exception encryption.
Problem statement   Before we enter the gates of darkness, we need to share some details about our system so that you know where we’re coming from.   The language of choice for our backend applications is Python 3. To achieve exception encryption we hook into Python’s error handling system and modify the traceback before logging it. This happens inside a function called init_logging() and looks roughly like this:
We use Spark 2.4.4. Spark Jobs are written entirely in Python; consequently, we are concerned with Python exceptions here. If you’ve ever seen a complete set of logs from a YARN-managed PySpark cluster, you know that a single ValueError can get logged tens of times in different forms; our goal will be to make sure all of them are either not present or encrypted.   We’ll be using the following error to simulate an exception raised by a Python function handling sensitive customer information:
Looking at this, we can separate the problem into 2 parts: the driver and the executors.   The executors   Let’s start with what we initially (correctly) perceived to be the main issue. Spark Executors are a fairly straightforward concept until you add Python into the mix. The specifics of what’s going on inside are not often talked about and are relevant to the discussion at hand, so let’s dive in.
All executors are actually JVMs, not python interpreters, and are implemented in Scala. Upon receiving Python code that needs to be executed (e.g. in rdd.map) they start a daemon written in Python that is responsible for forking the worker processes and supplying them with means of talking to the JVM, via sockets.   The protocol here is pretty convoluted and very low-level, so we won’t go into too much depth. What will be relevant to us are two details; both have to do with communication between the driver and the JVM:   The JVM executor expects the daemon to open a listening socket on the loopback interface and communicate the port back to it via stdout The worker code contains a general try-except that catches any errors from the application code and writes the traceback to the socket that’s read by the JVM   Point 2 is how the Python exceptions actually get to the executor logs, which is exactly why we can’t just use init_logging, even if we could guarantee that it was called: Python tracebacks are actually logged by Scala code!   How is this information useful? Well, you might notice that the daemon controls all Python execution, as it spawns the workers. If we can make it spawn a worker that will encrypt exceptions, our problems are solved. And it turns out Spark has an option that does just that: spark.python.daemon.module. This solution actually works; the problem is it’s incredibly fragile:   We now have to copy the code of the driver, which makes spark version updates difficult Remember, it communicates the port to the JVM via stdout. Anything else written to stdout (say, a warning output by one of the packages used for encryption) destroys the executor:
As you can probably tell by the level of detail here, we really did think we could do the encryption on this level. Disappointed, we went one level up and took a look at how the PythonException was handled in the Scala code.   Turns out it’s just logged on ERROR level with the Python traceback received from the worker treated as the message. Spark uses log4j, which provides a number of options to extend it; Spark, additionally, provides the option to override log processing using its configuration.   Thus, we will have achieved our goal if we encrypted the messages of all exceptions on log4j level. We did it by creating a custom RealEncryptExceptionLayout class that simply calls the default one unless it gets an exception, in which case it substitutes it with the one with an encrypted message. Here’s how it broadly looks:
To make this work we shipped this as a jar to the cluster and, importantly, specified the following configuration:
And voila! The driver: executor errors by way of Py4J   Satisfied with ourselves, we decided to grep the logs for the error before moving on to errors in the driver. Said moving on was not yet to be, however, as we found the following in the driver’s stdout:
This not only is incredibly readable but also not encrypted! This exception, as you can very easily tell, is thrown by the Scala code, specifically DAGScheduler, when a task set fails, in this case due to repeated task failures.   Fortunately for us, as illustrated by the diagram above, the driver simply runs python code in the interpreter that, as far as it’s concerned, just happens to call py4j APIs that, in turn, communicate with the JVM. Thus, it’s not meaningfully different from our backend applications in terms of error handling, so we can simply reuse the init_logging() function. If we do it and check the stdout we see that it does indeed work:
The driver: executor errors by way of TaskSetManager   Yes, believe it or not, we haven’t escaped the shadow of the Executor just yet. We’ve seen our fair share of the driver’s stdout. But what about stderr? Wouldn’t any reasonable person expect to see some of those juicy errors there as well?   We pride ourselves on being occasionally reasonable, so we did check. And lo and behold:
Turns out there is yet another component that reports errors from the executors: TaskSetManager; our good friend DAGScheduler also logs this error when a stage crashes because of it. Both of them, however, do this while processing events initially originating in the executors; where does the traceback really come from? In a rare flash of logic in our dark journey, from the Executor class, specifically the run method:
Aha, there’s a Serializer here! That’s very promising, we should be able to extend/replace it to encrypt the exception before actual serialization, right? Wrong. In fact, to our dismay, that used to be possible but was removed in version 2.0.0 (reference: https://issues.apache.org/jira/browse/SPARK-12414).   Seeing as how nothing is configurable at this end, let’s go back to the TaskSetManager and DAGScheduler and note that the offending tracebacks are logged by both of them. Since we are already manipulating the logging mechanism, why not go further down that lane and encrypt these logs as well?   Sure, that’s a possible solution. However, both log lines, as you can see in the snippet, are INFO. To find out that this particular log line contains a Python traceback from an executor we’d have to modify the Layout to parse it. Instead of doing that and risking writing a bad regex (a distinct possibility as some believe a good regex is an animal about as real as a unicorn) we decided to go for a simple and elegant solution. We simply don’t ship the .jar containing the Layout to the driver; like we said, elegant. That turns out to have the following effect:
And that’s all that we find in the stderr! Which suits us just fine, as any errors from the driver will be wrapped in Py4J, diligently reported in the stdout and, as we’ve established, encrypted.   The driver: python errors   That takes care of the executor errors in the driver. But the driver is nothing to sniff at either. It can fail and log exceptions just as well, can’t it?   As you have probably already guessed, this isn’t really a problem. After all, the driver is just running python code, and we’re already calling init_logging().   Satisfyingly enough it turns out to work as one would expect. For these errors we again need to look at the driver’s stdout. If we raise the exception in the code executed in the driver (i.e. the main function) the stdout normally contains:
Calling init_logging() turns this traceback into:
Conclusion   And thus our journey comes to an end. Ultimately our struggle has led us to two realizations; neither is particularly groundbreaking, but both are important to understand when dealing with PySpark:   Spark is not afraid to repeat itself in the logs, especially when it comes to errors. PySpark specifically is written in such a way that the driver and the executors are very different.   Before we say our goodbyes we feel like we must address one question: WHY? Why go through with this and not just abandon this complicated project?    Considering that the data our Spark jobs tend to handle is very sensitive, in most cases it is various properties of emails sent or received by our customers. If we give up on encrypting the exceptions, we must accept that this very sensitive information could end up in a traceback, at which point it will be propagated by Spark to various log files. The only real way to guarantee no personal data is leaked in this case is to forbid access to the logs altogether.   And while we did have to descend into the abyss and come back to achieve error encryption, debugging Spark jobs without access to logs is inviting the abyss inside yourself.
Read Blog Post
Engineering Team
How We Improved Developer Experience in a Rapidly Growing Engineering Team
by Andy Smith Friday, April 16th, 2021
Developer experience is one of most important things for a Head of Engineering to care about. Is development safe and fast? Are developers proud of their work? Are our processes enabling great collaboration and getting the best out of the team?  But sometimes, developer experience doesn’t get the attention it deserves. It is never the most urgent problem to solve, there are lots of different opinions about how to make improvements, and it seems very hard to measure.  At Tessian the team grows and evolves very quickly; we’ve gone from 20 developers to over 60 in just 3 years.  When the team was smaller, it was straightforward to keep a finger on the pulse of developer experience. With such a large and rapidly growing team, it’s all too easy for developer experience to be overshadowed by other priorities. At the end of 2020, it became clear that we needed a way to get a department-wide view of the perception of our developer experience that we could use to inform decisions and see whether those decisions had an impact. We decided one thing that would really help is a regular survey.  This would help us spot patterns quickly and it would give us a way to know if we were improving or getting worse. Most importantly it gives everyone in the team a chance to have their say and to understand what others are thinking.  Borrowing some ideas from Spotify, we sent the survey out in January to the whole Engineering team to get their honest, anonymized feedback. We’ll be repeating this quarterly.  Here are some of the high-level topics we covered in the survey. Speed and ease To better understand if our developers feel they can work quickly and securely, we asked the following questions: How simple, safe and painless is it to release your work? Do you feel that the speed of development is high? !function(e,t,s,i){var n="InfogramEmbeds",o=e.getElementsByTagName("script"),d=o[0],r=/^http:/.test(e.location)?"http:":"https:";if(/^\/{2}/.test(i)&&(i=r+i),window[n]&&window[n].initialized)window[n].process&&window[n].process();else if(!e.getElementById(s)){var a=e.createElement("script");a.async=1,a.id=s,a.src=i,d.parentNode.insertBefore(a,d)}}(document,0,"infogram-async","//e.infogram.com/js/dist/embed-loader-min.js");
You can see we got a big spread of answers, with quite a few detractors. We looked into this more deeply and identified that the primary driver for this is that some changes cannot be released independently by developers; some changes have a dependency on other teams and this can slow down development.  We’d heard similar feedback before running the survey which had led us to start migrating from Amazon ECS to Kubernetes. This would allow our Engineering teams to make more changes themselves. It was great to validate this strategy with results from the survey. More feedback called out a lack of test automation in an important component of our system.  We weren’t taking risks here, but we were using up Engineering time unnecessarily. This led to us deciding to commit to a project that would bring automation here. This has already led to us finding issues 15x quicker than before:
Autonomy and satisfaction We identified two areas of strength revealed by asking the following questions: How proud are you of the work you produce and the impact it has for customers? How much do you feel your team has a say in what they build and how they build it? !function(e,t,s,i){var n="InfogramEmbeds",o=e.getElementsByTagName("script"),d=o[0],r=/^http:/.test(e.location)?"http:":"https:";if(/^\/{2}/.test(i)&&(i=r+i),window[n]&&window[n].initialized)window[n].process&&window[n].process();else if(!e.getElementById(s)){var a=e.createElement("script");a.async=1,a.id=s,a.src=i,d.parentNode.insertBefore(a,d)}}(document,0,"infogram-async","//e.infogram.com/js/dist/embed-loader-min.js");
These are two areas that we’ve always worked very hard on because they are so important to us at Tessian. In fact, customer impact and having a say in what is built are the top two reasons that engineers decide to join Tessian.  We’ve recently introduced a Slack channel called #securingthehumanlayer, where our Sales and Customer Success teams share quotes and stories from customers and prospects who have been wowed by their Tessian experience or who have avoided major data breaches (or embarrassing ‘Oh sh*t’ moments!).  We’ve also introduced changes to how OKRs are set, which gives the team much more autonomy over their OKRs and more time to collaborate with other teams when defining OKRs. Recently we launched a new product feature, Misattached File Prevention. Within one hour of enabling this product for our customers, we were able to share an anonymised story of an awesome flag that we’d caught.
What’s next? We’re running the next survey again soon and are excited to see what we learn and how we can make the developer experience at Tessian as great as possible.
Read Blog Post
Engineering Team, Compliance, Life at Tessian
Securing SOC 2 Certification
by Trevor Luker Tuesday, March 30th, 2021
Building on our existing ISO 27001 security certification, Tessian is excited to announce that we have achieved Service Organization Control 2 Type 2 (SOC 2) compliance in the key domains of Security, Confidentiality and Availability with zero exceptions on our very first attempt. Achieving full SOC 2 Type 2 compliance within 6 months is simply sensational and is a huge achievement for our company. It reinforces our message to customers and prospects that Information Security and protecting customer data is at the very core of everything Tessian does.
The Journey We began the preparations for SOC 2 in September 2020 and initiated the formal process in October. Having previously experienced the pain and trauma of doing SOC 2 manually, we knew that to move quickly, we needed tooling to assist with the evidence gathering and reporting.  Fortunately we were introduced to VANTA, which automates the majority of the information gathering tasks, allowing the Tessian team to concentrate on identifying and closing any gaps we had. VANTA is a great platform, and we would recommend it to any other company undertaking SOC 2 or ISO 27001 certification. For the external audit part of the process, we were especially fortunate to team up with Barr Advisory who proactively helped us navigate the maze of the Trust Service Criteria requirements. They provided skilled, objective advice and guidance along the way, and we would particularly like to thank Cody Hewell and Kyle Helles for their insights, enthusiasm and support. Tessian chose an accelerated three month observation period, which in turn, put a lot of pressure on internal resources to respond to information requests and deliver process changes as required. The Tessian team knew how important SOC 2 was to us strategically and rallied to the challenge. Despite some extremely short timeframes, we were able to deliver the evidence that the auditors needed.  A huge team effort and a great reflection of Tessian’s Craft At Speed value. What Next? Achieving SOC 2 Type 2 is a crucial step for Tessian as we expand further into the large enterprise space. It’s also the basis on which we will further develop our compliance and risk management initiatives, leading to specialized government security accreditation in the US and Europe over the next year or two.
Read Blog Post