Data Science Stories & Insights

For us, data science is more than a skill or profession. It is a calling and a way of life. We have a personal passion for trying to solve the previously impossible. We want to share our passion with you. Each week we will share ideas, connect you with the latest topics and trends, and help you start your journey towards a career in data science.

Data Analysis on Aviation Accidents

By | Booz Allen, Data Science, Kaggle | No Comments

Hey there! My name is Katherine Larson and I joined on as a Data Scientist in July 2016, though I had been interning with the firm since 2014. Since my first internship with Booz Allen, it’s been embedded in my head that data is the key to everything. All the trends in the data hold meaning, but it’s up to us to discover what that meaning is through data science techniques. Read More

To Some It’s a Competition; To Me It’s Personal

By | Booz Allen, Data Science, Kaggle | No Comments

Vegetarians don’t understand what I am about to tell you.  I know they like to tell you that veggie-burgers can be just as good; but anyone with a true addiction to the great North American bovine knows it is simply false.  So here it goes: my father has not had a cheeseburger in 18 months.  On the law of averages in this country that would make him a carnivorous outlier.  But Bernie is no ordinary carnivore.  Dad is a man who enjoys his burgers so much that a table of raucous companions would come to silence on the rare occasion he would order any another dish at a restaurant.  But he has not had a burger in 18 months.  The sad fact is that cancer not only takes the people we love, it can also take a way of life.   Read More

Turning Machine Intelligence Against Cancer

By | Booz Allen, Data Science, Kaggle | No Comments
In the U.S., cancer will strike two in every five people in their lifetimes. But it affects all of us.

That’s why, in 2015, the office of the Vice President announced the Cancer Moonshot. It’s an audacious effort to make a decade’s worth of progress in cancer prevention, diagnosis, and treatment in just five years.

Beginning today, the 2017 Data Science Bowl will pursue one of the Cancer Moonshot’s key goals: unleashing the power of data against this deadly disease. Presented by Booz Allen and Kaggle, the competition will convene the data science and medical communities to develop cancer detection algorithms, and help end the disease as we know it. Read More

How Data Science Can Help Cure Cancer

By | Booz Allen, Data Science | No Comments

I will never forget that call.image001

“Kelly has cancer,” my dad said softly.

Knees weak, I sat down on the bed. I didn’t know if my sister was going to live. And, despite us having spent decades doing everything together, she’d have to fight this battle on her own. I’m not the only one who’s heard that kind of call. The moment I experienced was not singular to me, it is one that is repeated over 12.7 million times each year – with over half of those ultimately not surviving. Read More

Winning the 2nd Annual Data Science Bowl: Hedge Funds to Heart Disease

By | Booz Allen, Data Science | No Comments

Tencia Lee, a Math graduate and hedge fund trader, partnered with Qi Liu, a PhD in Physics also with a hedge fund background, to devise the winning algorithm in this year’s Data Science Bowl. They spent more than 100 hours each in evenings and on weekends building and testing algorithms. Working in parallel, Lee and Liu built and trialled hundreds of algorithms to read the heart scans. Their efforts paid off, with the largest prize in the competition, among 993 data scientist contestants in the Data Science Bowl. In this blog, Tencia Lee reveals the work behind the win. Read More

Leading and Winning Team Submissions Analysis

By | Booz Allen, Data Science | No Comments

Can we determine clinical applicability?

This year’s competition was intended to catalyze a change in cardiac diagnostics, so connecting the competition participants and the medical community is an essential part of the DSB. I have done some preliminary analysis of the Data Science Bowl’s (DSB) top 4 team submissions. The goal is to present the results in terms that are meaningful to the medical research community. In doing so I hope to spark a dialog between the communities. Read More

Segmentation and LV localization Based Approaches

By | Booz Allen, NVIDIA | No Comments
In our last blog post we described an end-to-end deep learning solution to this challenge. By “end-to-end” we mean that the raw pixels constituting a SAX study for an individual patient were fed into a convolutional neural network (ConvNet) and predicted left ventricle (LV) systolic and diastolic CDFs came out the other end – the only other processing that took place was the zero mean unit variance (ZMUV) pre-processing of the images. Whilst this approach to the problem is elegant in its simplicity, it is also a very challenging function for a neural network to learn. This is because there is no explicit training signal for the area of the left ventricle that should be measured from each image, just the whole volume for the SAX study. Read More

Building and Working on a Dispersed Team

By | Booz Allen, NVIDIA | No Comments

This year is the first time that Booz Allen and NVIDIA have partnered to enter a team into the Data Science Bowl. Our goal for this combined team was to share some of our successes and challenges along the way, as well as to provide insight into how to approach this type of competition. We’ve been able to post updates about our progress, respond to questions on the Kaggle forums, and help other teams find new ways of looking at the problem. Of course, we’re also hoping that by combining our talent and resources we will be able to come up with a top solution – even if we’re not eligible for the prize money. Read More

Intro Guide to AWS

By | Booz Allen, Data Science | No Comments

This guide will walk you through using spot instances with Amazon Web Services (AWS) to help you save money when training DSB models on Mxnet. A spot instance on AWS is a virtual machine hosted on the Amazon cloud that you bid for. If you are outbid, the instance is terminated and all data associated with that instance is lost. There are certain steps which may require external search such as using Google/Bing. For instance, this guide does not cover setup of an AWS. We assume you have an AWS account, and we start from there. Read More

Image Preprocessing: The Challenges and Approach

By | Booz Allen, NVIDIA | No Comments
The dataset for the 2016 Data Science Bowl presents several challenges for automated exploitation. As the images were collected in a real world setting, with several types of sensors, there is a great deal of variation from patient to patient with respect to image orientation, pixel spacing, and intensity scaling. All of these factors should be dealt as part of any competitive solution; while they may not be required for a good solution, a winning design requires every last bit of information to be squeezed out of the data. Read More

Informatics: The End of Demographics with Deep, Wide, Fast Data

By | Booz Allen, Data Science | No Comments

Years ago, when I was working as a manager in NASA’s Astrophysics Data Facility, we curated data sets from thousands of NASA space science experiments. Each of those data sets was relatively small (by today’s “big data” standards), and each was usually focused on some limited science problem, with a limited number of observed features, for a limited sample size, within a limited domain of study. The data were useful to address specific questions and specific problems.

Read More

Quantum Computing and the Race for Better Analytics

By | Booz Allen, Data Science | No Comments

The battle is set: on one side stands data – ever growing, ever more important; on the other stands analytics technology – also continuously gaining speed and capabilities. We, as machine learning and data analytics enthusiasts, want nothing more than to see the “tech” side winning this battle. But, as our datasets and problems continue to grow larger and larger, our tools to analyze and solve them must grow in stride, less we let the untapped power of the data go to waste. It is like a twisted, data version of Frankenstein, our own creations like the internet of things (IoT) are producing vast quantities of data that we can’t properly deal with. The waste and opportunity cost from unanalyzed data is out of control! Read More

5 Awesome Problems Solved Through Data Science

By | Booz Allen, Data Science | No Comments

Booz Allen does not just have a data science team. Yes, we are proud of our industry leading, 600+ member group of data scientists; but that team is not evidence of our firm simply checking a box in the technology market. Our data science capabilities, in contrast, are indicative of our diagnostic fascination with finding new, better ways of answering our world’s oldest questions. Read More

CRPS and Its Implications

By | Booz Allen, Data Science | No Comments

The 2015/2016 Data Science Bowl is scored using a relatively little-known statistic, the Continuous Ranked Probability Score (CRPS). A detailed mathematical explanation of CRPS is available here and on the Data Science Bowl Kaggle evaluation page. It’s difficult to conceptualize the meaning of a specific CRPS, especially since the score can often appear “low” as its value nears zero. Still, the score has meaningful implications for the utility of your algorithm in a clinical setting.

Read More

The Power of Analogies

By | Booz Allen, Data Science | No Comments

Practitioners and data scientists have developed their own jargon, such that communication and collaboration can prove difficult across domains. For example, doctors might find it difficult to communicate to data scientists why some data (e.g., shape and structural organization of a tumor in a Magnetic Resonance Imaging scan) are especially important for a given diagnosis (metastatic potential of the tumor), and how this can be reflected in the data structure. Likewise, data scientists might struggle to explain to physicians how or why a given analytical tool (e.g., Bayesian networks) might be effective for uncovering useful information in patient records (changes in prescription medicine use over time as a predictor of future illness). The problem is only compounded when insurance companies, patient advocates, regulatory agencies, and other stakeholders weigh in. Read More

Telling Your Data’s Stories

By | Booz Allen, Data Science | No Comments

As data scientists, we look for stories within data. We use math, statistics, programming, and learning algorithms to uncover these stories. We love to discuss our explorations into data with those who will listen, but because of the esoteric nature of our work, our discoveries may not be widely heard or understood. To engage an audience, we should be great at visualizing and telling these stories.

Read More

At the Heart of Data Science

By | Booz Allen, Data Science | No Comments

DataScienceBowl_121415_LI

Every 24 hours, your heart beats approximately 100,000 times. No matter whether it was a day of triumph or defeat, discovery or pursuit, love or loss, you can count on relatively 100,000 reminders that you have embarked, once again, upon another chapter of the best story ever told – yours. In a year, that remarkable force will take 2.5 billion steps toward the future. So, though our hearts may, at times, yearn for the past or a present that never was, its beating drives us ever forward. And that is a journey that must be protected at all costs.    Read More

Useful Applications of Simulation Modeling

By | Booz Allen, Data Science | No Comments

Simulation Modeling is a structured approach to discovering key variable relationships within a system. Systems take on many forms across sectors, from agriculture to aerospace and defense to zoology. These systems are generally finite and operate within a set of defined business rules, often forcing decision makers to make difficult tradeoffs that can result in a range of profitable, or costly, outcomes. Read More

Data Science Image Learning

By | Booz Allen, Data Science | No Comments

With the next Data Science Bowl just around the corner, I set out to prepare myself for the competition. The truth: I’m not a coder.

I have an interest in data science. I appreciate the process—and the results. I’m open to advice from the best of the best. With that in mind, I set out to find the top tips and tricks buried within last year’s competition forums. What I found is a treasure trove for anyone who is going to participate, from beginner to seasoned pro. Read More

Democratizing Data Science

By | Booz Allen, Data Science | No Comments

Data science is often regarded as an elite field. It draws experts with advanced degrees in math and science, fluency in multiple programming languages, and a firm grasp of statistics and probability theory. To reach the top of a Kaggle challenge like the Data Science Bowl is to demonstrate a feat of technical wizardry. Read More

Focusing On Diversity

By | Booz Allen, Data Science | No Comments

Data Science is powerful. By combining the fields of statistics and computer science it allows us to analyze and understand data and make that data understandable to others. This means Data Scientists can direct the public’s sightlines to particular trends or information. One particular trend in data science, and STEM in general, worth mentioning is: Despite the growing field, only a small fraction of the STEM workforce consists of minority groups. Read More

Tools for the Data Scientist

By | Data Science | No Comments

Modern computing has no shortage of tools for the data scientist. The open source community alters the landscape every six to 12 months, and competition keeps you on the bleeding edge. In my career as a data scientist, I use everything from scientific Python™ packages to the newest cloud computing architectures—and sometimes all within the same project, as the initial stages of data exploration and mining are often done in a different language than the final product implementation. Read More

Overfitting

By | Data Science | No Comments

Overfitting is an issue within machine learning and statistics. It occurs when we build models that closely explain a training data set, but fail to generalize when applied to other data sets. Overfitting is a part of life as a data scientist. We all do it to some degree or another. In the case of forecasting in data science competitions, it might actually be advantageous to overfit to Kaggle’s public leaderboard. However, if you have an independent, and identically distributed (iid), split between train and test data sets, then it’s probably better to come up with a leak-free cross-validation (CV) scheme. Read More

Our Journey to an Analytics Driven Culture

By | Data Science | No Comments

There has been a lot of news coverage lately around the topic of creating a data-driven culture within an organization. The fact of the matter is a data-driven culture is crippling. We tried to create a data-driven culture too, but ultimately found that our real transformation came by using data as inputs into a real Analytics driven-culture. A culture that values true experimentation, understands failure is the price of discovery, and actually makes use of analytic outputs for decision-making. Read More

Support Vector Machines in Data Science

By | Data Science | No Comments

Support Vector Machines (SVMs) may not be as popular as Neural Networks within data science, but they act as powerful, useful algorithms. One of the difficulties of SVMs has been the computational effort required to train them. However LIBSVM, which has been used for over a decade, can fairly easily handle the 30,000 training points in the National Data Science Bowl competition’s data set. That makes SVMs a viable tool for you to use both in general, and for the purpose of competing in the Data Science Bowl. Read More

Booz Allen Hamilton and Kaggle Data Scientists Visit the Hatfield Marine Science Center

By | Data Science | No Comments

We were incredibly excited to finally meet our data science partners from Booz Allen and Kaggle face-to-face during their site visit on January 15-16, 2015. We had communicated for months prior to their visit, discussing data, analytics, the number of plankton classes for the data science competition, and the general ins and outs of the Data Science Bowl.

Read More

3 Methods for Feature Creation and Data Transformation in Data Science

By | Data Science | No Comments

Building on Paul Yacci’s earlier post on the importance of feature selection in data science and data analysis, the creation of new features from your existing data set can play a large role in the performance of your model in data science. There are multiple methods of feature creation and data transformation. Often, finding the right transformation of your data can reveal relationships that would be difficult see otherwise, and may also make it easier for your model to separate classes. Read More

Feature Selection

By | Data Science | No Comments

Feature Selection is crucial to any model construction in data science. Focusing on the most important, relevant features will help any data scientist design a better model and accelerate outcomes.

So what exactly is a feature in data science analysis? Let’s start there. Read More

Going Beyond in Data Science

By | Data Science | No Comments

Being a data scientist is more than having a technical background; it’s also about going beyond your tools and understanding what it really means to tackle complex data analysis problems. No matter if you are a seasoned big data expert or are just considering moving into the field, here are five things you ought to know. Read More

Become a Data Scientist

Interested in becoming a data scientist? Or changing to a new challenge? Check out career opportunities waiting now at Booz Allen Hamilton.