Data Science Stories & Insights

For us, data science is more than a skill or profession. It is a calling and a way of life. We have a personal passion for trying to solve the previously impossible. We want to share our passion with you. Each week we will share ideas, connect you with the latest topics and trends, and help you start your journey towards a career in data science.

Gartner Data Summit: Collaboration and Communication are Key

By | Booz Allen, Data Science, Kaggle | No Comments

At Booz Allen, we’re committed to making the power of analytics tangible and accessible for a new generation of citizen data scientists. In support of that mission, our Sailfish team headed to Grapevine, Texas, from March 6-9 to attend the Gartner Data & Analytics Summit.

In the data science emerging technologies arena, few gatherings generate as much as excitement and attention as Gartner, which hosted 3,000 attendees with backgrounds in analytics, machine learning, data technologies, and data management. And while last year’s Gartner Data Analytics conference was themed on the citizen data scientist and the democratization of data, this year focused more on data science platforms, like Sailfish, that enable and empower those individuals.

As keynote speaker Margaret Heffernan noted, non-analytical factors—such as trust, open communication, and the ability to share insights—are critical for organizations to derive the most value from their data. Socializing data science models and results helps to build a data-driven culture of openness and collaboration across the enterprise. In another presentation, the leader of data analytics at Ford Motor Company validated the same point when he described how his internal analytics team acts in a consulting role to the different divisions of the corporation in order to build trust, value, cross-organization visibility, understanding of data science, and impactful analytic solutions.

At the Booz Allen exhibit booth, the Sailfish team demonstrated the platform’s similar ability to open up lines of communication among otherwise disparate groups. Through its unique social curation capabilities, Sailfish allows users to collect, organize, and catalog data, and then easily share data sets with a network of followers for feedback and refinement.

Data sharing and curation also complement Sailfish’s discovery capabilities, through which users can perform advanced analytics by querying the data sets in plain English. For example, visitors to the booth learned which stoplight in Washington, DC, generates the most traffic infraction tickets, which public universities deliver the best bargain and quality, and whether there’s correlation between SAT scores and retention at universities. (Hint: There is, and it’s strong.)

#Data4Good Twitter Chat

To amplify the conference conversation, on March 8, Sailfish Product Manager Seth Clark (@caradoxical) and Booz Allen Principal Data Scientist Kirk Borne (@KirkDBorne) also hosted a Twitter chat focused on the unique capability of data science platforms to enable #Data4Good activities, both internally and external to organizations.

Queried on how to spread the benefits of data science to as many people as possible, participants in the Twitter chat noted that data science competitions (such as the Data Science Bowl) enable data scientists to solve some of the world’s most challenging problems. They also discussed how easing access to data through platforms like Sailfish can empower exploration, discovery, and innovation.

On the question of how to broaden the dissemination of data science results, participant @Prashant_1722 suggested that public-private partnerships can be useful for vetting and using analytics results for societal good, while @knowlengr added that mobile-optimized access can help bridge the digital divide between those with and without computer access.

Among other topics, participants also discussed where data scientists can find resources to guide them in the ethical use and applications of artificial intelligence (see the Asilomar AI principles), as well as how leaders in an organization can ensure that data science benefits everyone equally (in the end, it is the responsibility of all of us who are involved). Search for #DataSciChat on Twitter to see the entire conversation.

And even if you didn’t attend the Gartner Summit, you can still learn how to tame your data and how to derive significant insights and value from it. Visit BoozAllen.com/Sailfish to get started today.

As the third Data Science Bowl focuses on early detection of lung cancer using patient scans, researchers at Stanford University recently developed an algorithm to detect skin cancer from photographs. This algorithm, which matches the performance of dermatologists when diagnosing skin lesions, could have a significant effect on remote diagnosis, especially given the team’s goal of adapting it for use on a smartphone. Read more at http://news.stanford.edu/2017/01/25/artificial-intelligence-used-identify-skin-cancer/

Data Science News: AI Detection of Skin Cancer

By | Booz Allen, Data Science, Kaggle | No Comments

As the third Data Science Bowl focuses on early detection of lung cancer using patient scans, researchers at Stanford University recently developed an algorithm to detect skin cancer from photographs. This algorithm, which matches the performance of dermatologists when diagnosing skin lesions, could have a significant effect on remote diagnosis, especially given the team’s goal of adapting it for use on a smartphone. Read more at http://news.stanford.edu/2017/01/25/artificial-intelligence-used-identify-skin-cancer/

I Know and Understand Cancer, All Too Well: Help Join the Fight

By | Booz Allen, Data Science, Kaggle | No Comments

My life has been surround by this disease for the past 20 years. In 1997, my brother was diagnosed with rare type of cancer called renal medullary carcinoma. He passed away at 35 years old, leaving behind a wife and two sons ages 3 and 5. They will never know how wonderful their father was. To this day, there is no cure for this cancer.  Read More

2017 #DataSciChat: Tips, Techniques, and that “Ah-Ha!” Moment

By | Booz Allen, Data Science, Kaggle | No Comments


Graphic by Marc Smith, Director, Social Media Research Foundation

The Data Science Bowl is dedicated to using the power of data to solve the world’s most difficult challenges. But we know that uncovering breakthroughs is impossible alone.

In that spirit, on Thursday, January 26, the Data Science Bowl hosted a Twitter chat for the community of problem-solvers dedicated to thinking bigger, asking the right questions, and this year, ending lung cancer as we know it. Read More

Data Analysis on Aviation Accidents

By | Booz Allen, Data Science, Kaggle | No Comments

Hey there! My name is Katherine Larson and I joined on as a Data Scientist in July 2016, though I had been interning with the firm since 2014. Since my first internship with Booz Allen, it’s been embedded in my head that data is the key to everything. All the trends in the data hold meaning, but it’s up to us to discover what that meaning is through data science techniques. Read More

To Some It’s a Competition; To Me It’s Personal

By | Booz Allen, Data Science, Kaggle | No Comments

Vegetarians don’t understand what I am about to tell you.  I know they like to tell you that veggie-burgers can be just as good; but anyone with a true addiction to the great North American bovine knows it is simply false.  So here it goes: my father has not had a cheeseburger in 18 months.  On the law of averages in this country that would make him a carnivorous outlier.  But Bernie is no ordinary carnivore.  Dad is a man who enjoys his burgers so much that a table of raucous companions would come to silence on the rare occasion he would order any another dish at a restaurant.  But he has not had a burger in 18 months.  The sad fact is that cancer not only takes the people we love, it can also take a way of life.   Read More

Turning Machine Intelligence Against Cancer

By | Booz Allen, Data Science, Kaggle | No Comments
In the U.S., cancer will strike two in every five people in their lifetimes. But it affects all of us.

That’s why, in 2015, the office of the Vice President announced the Cancer Moonshot. It’s an audacious effort to make a decade’s worth of progress in cancer prevention, diagnosis, and treatment in just five years.

Beginning today, the 2017 Data Science Bowl will pursue one of the Cancer Moonshot’s key goals: unleashing the power of data against this deadly disease. Presented by Booz Allen and Kaggle, the competition will convene the data science and medical communities to develop cancer detection algorithms, and help end the disease as we know it. Read More

How Data Science Can Help Cure Cancer

By | Booz Allen, Data Science | No Comments

I will never forget that call.image001

“Kelly has cancer,” my dad said softly.

Knees weak, I sat down on the bed. I didn’t know if my sister was going to live. And, despite us having spent decades doing everything together, she’d have to fight this battle on her own. I’m not the only one who’s heard that kind of call. The moment I experienced was not singular to me, it is one that is repeated over 12.7 million times each year – with over half of those ultimately not surviving. Read More

Winning the 2nd Annual Data Science Bowl: Hedge Funds to Heart Disease

By | Booz Allen, Data Science | No Comments

Tencia Lee, a Math graduate and hedge fund trader, partnered with Qi Liu, a PhD in Physics also with a hedge fund background, to devise the winning algorithm in this year’s Data Science Bowl. They spent more than 100 hours each in evenings and on weekends building and testing algorithms. Working in parallel, Lee and Liu built and trialled hundreds of algorithms to read the heart scans. Their efforts paid off, with the largest prize in the competition, among 993 data scientist contestants in the Data Science Bowl. In this blog, Tencia Lee reveals the work behind the win. Read More

Leading and Winning Team Submissions Analysis

By | Booz Allen, Data Science | No Comments

Can we determine clinical applicability?

This year’s competition was intended to catalyze a change in cardiac diagnostics, so connecting the competition participants and the medical community is an essential part of the DSB. I have done some preliminary analysis of the Data Science Bowl’s (DSB) top 4 team submissions. The goal is to present the results in terms that are meaningful to the medical research community. In doing so I hope to spark a dialog between the communities. Read More

Segmentation and LV localization Based Approaches

By | Booz Allen, NVIDIA | No Comments
In our last blog post we described an end-to-end deep learning solution to this challenge. By “end-to-end” we mean that the raw pixels constituting a SAX study for an individual patient were fed into a convolutional neural network (ConvNet) and predicted left ventricle (LV) systolic and diastolic CDFs came out the other end – the only other processing that took place was the zero mean unit variance (ZMUV) pre-processing of the images. Whilst this approach to the problem is elegant in its simplicity, it is also a very challenging function for a neural network to learn. This is because there is no explicit training signal for the area of the left ventricle that should be measured from each image, just the whole volume for the SAX study. Read More

Building and Working on a Dispersed Team

By | Booz Allen, NVIDIA | No Comments

This year is the first time that Booz Allen and NVIDIA have partnered to enter a team into the Data Science Bowl. Our goal for this combined team was to share some of our successes and challenges along the way, as well as to provide insight into how to approach this type of competition. We’ve been able to post updates about our progress, respond to questions on the Kaggle forums, and help other teams find new ways of looking at the problem. Of course, we’re also hoping that by combining our talent and resources we will be able to come up with a top solution – even if we’re not eligible for the prize money. Read More

Intro Guide to AWS

By | Booz Allen, Data Science | No Comments

This guide will walk you through using spot instances with Amazon Web Services (AWS) to help you save money when training DSB models on Mxnet. A spot instance on AWS is a virtual machine hosted on the Amazon cloud that you bid for. If you are outbid, the instance is terminated and all data associated with that instance is lost. There are certain steps which may require external search such as using Google/Bing. For instance, this guide does not cover setup of an AWS. We assume you have an AWS account, and we start from there. Read More

Image Preprocessing: The Challenges and Approach

By | Booz Allen, NVIDIA | No Comments
The dataset for the 2016 Data Science Bowl presents several challenges for automated exploitation. As the images were collected in a real world setting, with several types of sensors, there is a great deal of variation from patient to patient with respect to image orientation, pixel spacing, and intensity scaling. All of these factors should be dealt as part of any competitive solution; while they may not be required for a good solution, a winning design requires every last bit of information to be squeezed out of the data. Read More

Informatics: The End of Demographics with Deep, Wide, Fast Data

By | Booz Allen, Data Science | No Comments

Years ago, when I was working as a manager in NASA’s Astrophysics Data Facility, we curated data sets from thousands of NASA space science experiments. Each of those data sets was relatively small (by today’s “big data” standards), and each was usually focused on some limited science problem, with a limited number of observed features, for a limited sample size, within a limited domain of study. The data were useful to address specific questions and specific problems.

Read More

Quantum Computing and the Race for Better Analytics

By | Booz Allen, Data Science | No Comments

The battle is set: on one side stands data – ever growing, ever more important; on the other stands analytics technology – also continuously gaining speed and capabilities. We, as machine learning and data analytics enthusiasts, want nothing more than to see the “tech” side winning this battle. But, as our datasets and problems continue to grow larger and larger, our tools to analyze and solve them must grow in stride, less we let the untapped power of the data go to waste. It is like a twisted, data version of Frankenstein, our own creations like the internet of things (IoT) are producing vast quantities of data that we can’t properly deal with. The waste and opportunity cost from unanalyzed data is out of control! Read More

5 Awesome Problems Solved Through Data Science

By | Booz Allen, Data Science | No Comments

Booz Allen does not just have a data science team. Yes, we are proud of our industry leading, 600+ member group of data scientists; but that team is not evidence of our firm simply checking a box in the technology market. Our data science capabilities, in contrast, are indicative of our diagnostic fascination with finding new, better ways of answering our world’s oldest questions. Read More

CRPS and Its Implications

By | Booz Allen, Data Science | No Comments

The 2015/2016 Data Science Bowl is scored using a relatively little-known statistic, the Continuous Ranked Probability Score (CRPS). A detailed mathematical explanation of CRPS is available here and on the Data Science Bowl Kaggle evaluation page. It’s difficult to conceptualize the meaning of a specific CRPS, especially since the score can often appear “low” as its value nears zero. Still, the score has meaningful implications for the utility of your algorithm in a clinical setting.

Read More

The Power of Analogies

By | Booz Allen, Data Science | No Comments

Practitioners and data scientists have developed their own jargon, such that communication and collaboration can prove difficult across domains. For example, doctors might find it difficult to communicate to data scientists why some data (e.g., shape and structural organization of a tumor in a Magnetic Resonance Imaging scan) are especially important for a given diagnosis (metastatic potential of the tumor), and how this can be reflected in the data structure. Likewise, data scientists might struggle to explain to physicians how or why a given analytical tool (e.g., Bayesian networks) might be effective for uncovering useful information in patient records (changes in prescription medicine use over time as a predictor of future illness). The problem is only compounded when insurance companies, patient advocates, regulatory agencies, and other stakeholders weigh in. Read More

Telling Your Data’s Stories

By | Booz Allen, Data Science | No Comments

As data scientists, we look for stories within data. We use math, statistics, programming, and learning algorithms to uncover these stories. We love to discuss our explorations into data with those who will listen, but because of the esoteric nature of our work, our discoveries may not be widely heard or understood. To engage an audience, we should be great at visualizing and telling these stories.

Read More

At the Heart of Data Science

By | Booz Allen, Data Science | No Comments

DataScienceBowl_121415_LI

Every 24 hours, your heart beats approximately 100,000 times. No matter whether it was a day of triumph or defeat, discovery or pursuit, love or loss, you can count on relatively 100,000 reminders that you have embarked, once again, upon another chapter of the best story ever told – yours. In a year, that remarkable force will take 2.5 billion steps toward the future. So, though our hearts may, at times, yearn for the past or a present that never was, its beating drives us ever forward. And that is a journey that must be protected at all costs.    Read More

Useful Applications of Simulation Modeling

By | Booz Allen, Data Science | No Comments

Simulation Modeling is a structured approach to discovering key variable relationships within a system. Systems take on many forms across sectors, from agriculture to aerospace and defense to zoology. These systems are generally finite and operate within a set of defined business rules, often forcing decision makers to make difficult tradeoffs that can result in a range of profitable, or costly, outcomes. Read More

Data Science Image Learning

By | Booz Allen, Data Science | No Comments

With the next Data Science Bowl just around the corner, I set out to prepare myself for the competition. The truth: I’m not a coder.

I have an interest in data science. I appreciate the process—and the results. I’m open to advice from the best of the best. With that in mind, I set out to find the top tips and tricks buried within last year’s competition forums. What I found is a treasure trove for anyone who is going to participate, from beginner to seasoned pro. Read More

Democratizing Data Science

By | Booz Allen, Data Science | No Comments

Data science is often regarded as an elite field. It draws experts with advanced degrees in math and science, fluency in multiple programming languages, and a firm grasp of statistics and probability theory. To reach the top of a Kaggle challenge like the Data Science Bowl is to demonstrate a feat of technical wizardry. Read More

Focusing On Diversity

By | Booz Allen, Data Science | No Comments

Data Science is powerful. By combining the fields of statistics and computer science it allows us to analyze and understand data and make that data understandable to others. This means Data Scientists can direct the public’s sightlines to particular trends or information. One particular trend in data science, and STEM in general, worth mentioning is: Despite the growing field, only a small fraction of the STEM workforce consists of minority groups. Read More

Tools for the Data Scientist

By | Data Science | No Comments

Modern computing has no shortage of tools for the data scientist. The open source community alters the landscape every six to 12 months, and competition keeps you on the bleeding edge. In my career as a data scientist, I use everything from scientific Python™ packages to the newest cloud computing architectures—and sometimes all within the same project, as the initial stages of data exploration and mining are often done in a different language than the final product implementation. Read More

Overfitting

By | Data Science | No Comments

Overfitting is an issue within machine learning and statistics. It occurs when we build models that closely explain a training data set, but fail to generalize when applied to other data sets. Overfitting is a part of life as a data scientist. We all do it to some degree or another. In the case of forecasting in data science competitions, it might actually be advantageous to overfit to Kaggle’s public leaderboard. However, if you have an independent, and identically distributed (iid), split between train and test data sets, then it’s probably better to come up with a leak-free cross-validation (CV) scheme. Read More

Our Journey to an Analytics Driven Culture

By | Data Science | No Comments

There has been a lot of news coverage lately around the topic of creating a data-driven culture within an organization. The fact of the matter is a data-driven culture is crippling. We tried to create a data-driven culture too, but ultimately found that our real transformation came by using data as inputs into a real Analytics driven-culture. A culture that values true experimentation, understands failure is the price of discovery, and actually makes use of analytic outputs for decision-making. Read More

Support Vector Machines in Data Science

By | Data Science | No Comments

Support Vector Machines (SVMs) may not be as popular as Neural Networks within data science, but they act as powerful, useful algorithms. One of the difficulties of SVMs has been the computational effort required to train them. However LIBSVM, which has been used for over a decade, can fairly easily handle the 30,000 training points in the National Data Science Bowl competition’s data set. That makes SVMs a viable tool for you to use both in general, and for the purpose of competing in the Data Science Bowl. Read More

Booz Allen Hamilton and Kaggle Data Scientists Visit the Hatfield Marine Science Center

By | Data Science | No Comments

We were incredibly excited to finally meet our data science partners from Booz Allen and Kaggle face-to-face during their site visit on January 15-16, 2015. We had communicated for months prior to their visit, discussing data, analytics, the number of plankton classes for the data science competition, and the general ins and outs of the Data Science Bowl.

Read More

3 Methods for Feature Creation and Data Transformation in Data Science

By | Data Science | No Comments

Building on Paul Yacci’s earlier post on the importance of feature selection in data science and data analysis, the creation of new features from your existing data set can play a large role in the performance of your model in data science. There are multiple methods of feature creation and data transformation. Often, finding the right transformation of your data can reveal relationships that would be difficult see otherwise, and may also make it easier for your model to separate classes. Read More

Feature Selection

By | Data Science | No Comments

Feature Selection is crucial to any model construction in data science. Focusing on the most important, relevant features will help any data scientist design a better model and accelerate outcomes.

So what exactly is a feature in data science analysis? Let’s start there. Read More

Going Beyond in Data Science

By | Data Science | No Comments

Being a data scientist is more than having a technical background; it’s also about going beyond your tools and understanding what it really means to tackle complex data analysis problems. No matter if you are a seasoned big data expert or are just considering moving into the field, here are five things you ought to know. Read More

Become a Data Scientist

Interested in becoming a data scientist? Or changing to a new challenge? Check out career opportunities waiting now at Booz Allen Hamilton.