Hey there! My name is Katherine Larson and I joined on as a Data Scientist in July 2016, though I had been interning with the firm since 2014. Since my first internship with Booz Allen, it’s been embedded in my head that data is the key to everything. All the trends in the data hold meaning, but it’s up to us to discover what that meaning is through data science techniques. Read More
Data Science Stories & Insights
For us, data science is more than a skill or profession. It is a calling and a way of life. We have a personal passion for trying to solve the previously impossible. We want to share our passion with you. Each week we will share ideas, connect you with the latest topics and trends, and help you start your journey towards a career in data science.
Vegetarians don’t understand what I am about to tell you. I know they like to tell you that veggie-burgers can be just as good; but anyone with a true addiction to the great North American bovine knows it is simply false. So here it goes: my father has not had a cheeseburger in 18 months. On the law of averages in this country that would make him a carnivorous outlier. But Bernie is no ordinary carnivore. Dad is a man who enjoys his burgers so much that a table of raucous companions would come to silence on the rare occasion he would order any another dish at a restaurant. But he has not had a burger in 18 months. The sad fact is that cancer not only takes the people we love, it can also take a way of life. Read More
That’s why, in 2015, the office of the Vice President announced the Cancer Moonshot. It’s an audacious effort to make a decade’s worth of progress in cancer prevention, diagnosis, and treatment in just five years.
Beginning today, the 2017 Data Science Bowl will pursue one of the Cancer Moonshot’s key goals: unleashing the power of data against this deadly disease. Presented by Booz Allen and Kaggle, the competition will convene the data science and medical communities to develop cancer detection algorithms, and help end the disease as we know it. Read More
“Kelly has cancer,” my dad said softly.
Knees weak, I sat down on the bed. I didn’t know if my sister was going to live. And, despite us having spent decades doing everything together, she’d have to fight this battle on her own. I’m not the only one who’s heard that kind of call. The moment I experienced was not singular to me, it is one that is repeated over 12.7 million times each year – with over half of those ultimately not surviving. Read More
Tencia Lee, a Math graduate and hedge fund trader, partnered with Qi Liu, a PhD in Physics also with a hedge fund background, to devise the winning algorithm in this year’s Data Science Bowl. They spent more than 100 hours each in evenings and on weekends building and testing algorithms. Working in parallel, Lee and Liu built and trialled hundreds of algorithms to read the heart scans. Their efforts paid off, with the largest prize in the competition, among 993 data scientist contestants in the Data Science Bowl. In this blog, Tencia Lee reveals the work behind the win. Read More
Can we determine clinical applicability?
This year’s competition was intended to catalyze a change in cardiac diagnostics, so connecting the competition participants and the medical community is an essential part of the DSB. I have done some preliminary analysis of the Data Science Bowl’s (DSB) top 4 team submissions. The goal is to present the results in terms that are meaningful to the medical research community. In doing so I hope to spark a dialog between the communities. Read More
Each day, 1,500 people in the U.S. are diagnosed with heart failure. And yet, despite decades of medical advancements, assessing cardiac function remains a time-consuming undertaking.
Until, potentially, now. Read More
This year is the first time that Booz Allen and NVIDIA have partnered to enter a team into the Data Science Bowl. Our goal for this combined team was to share some of our successes and challenges along the way, as well as to provide insight into how to approach this type of competition. We’ve been able to post updates about our progress, respond to questions on the Kaggle forums, and help other teams find new ways of looking at the problem. Of course, we’re also hoping that by combining our talent and resources we will be able to come up with a top solution – even if we’re not eligible for the prize money. Read More
This guide will walk you through using spot instances with Amazon Web Services (AWS) to help you save money when training DSB models on Mxnet. A spot instance on AWS is a virtual machine hosted on the Amazon cloud that you bid for. If you are outbid, the instance is terminated and all data associated with that instance is lost. There are certain steps which may require external search such as using Google/Bing. For instance, this guide does not cover setup of an AWS. We assume you have an AWS account, and we start from there. Read More
Years ago, when I was working as a manager in NASA’s Astrophysics Data Facility, we curated data sets from thousands of NASA space science experiments. Each of those data sets was relatively small (by today’s “big data” standards), and each was usually focused on some limited science problem, with a limited number of observed features, for a limited sample size, within a limited domain of study. The data were useful to address specific questions and specific problems.
The battle is set: on one side stands data – ever growing, ever more important; on the other stands analytics technology – also continuously gaining speed and capabilities. We, as machine learning and data analytics enthusiasts, want nothing more than to see the “tech” side winning this battle. But, as our datasets and problems continue to grow larger and larger, our tools to analyze and solve them must grow in stride, less we let the untapped power of the data go to waste. It is like a twisted, data version of Frankenstein, our own creations like the internet of things (IoT) are producing vast quantities of data that we can’t properly deal with. The waste and opportunity cost from unanalyzed data is out of control! Read More
It is no secret that data science can help to solve the most complex and pressing challenges that face today’s business leaders. There is even a growing awareness of analytics’ potential to improve the ways in which governments around the world serve their citizens. Read More
Booz Allen does not just have a data science team. Yes, we are proud of our industry leading, 600+ member group of data scientists; but that team is not evidence of our firm simply checking a box in the technology market. Our data science capabilities, in contrast, are indicative of our diagnostic fascination with finding new, better ways of answering our world’s oldest questions. Read More
A guest post by Jessica Luo, Ph.D, Marine Biology and Fisheries at the Rosenstiel School of Marine and Atmospheric Science, University of Miami.
The most recent Paris conference on climate change, COP21, underscored how critical it is to understand and manage the impacts of climate change in all aspects of the human and earth ecosystem. Read More
The 2015/2016 Data Science Bowl is scored using a relatively little-known statistic, the Continuous Ranked Probability Score (CRPS). A detailed mathematical explanation of CRPS is available here and on the Data Science Bowl Kaggle evaluation page. It’s difficult to conceptualize the meaning of a specific CRPS, especially since the score can often appear “low” as its value nears zero. Still, the score has meaningful implications for the utility of your algorithm in a clinical setting.
Practitioners and data scientists have developed their own jargon, such that communication and collaboration can prove difficult across domains. For example, doctors might find it difficult to communicate to data scientists why some data (e.g., shape and structural organization of a tumor in a Magnetic Resonance Imaging scan) are especially important for a given diagnosis (metastatic potential of the tumor), and how this can be reflected in the data structure. Likewise, data scientists might struggle to explain to physicians how or why a given analytical tool (e.g., Bayesian networks) might be effective for uncovering useful information in patient records (changes in prescription medicine use over time as a predictor of future illness). The problem is only compounded when insurance companies, patient advocates, regulatory agencies, and other stakeholders weigh in. Read More
Thanks to all who participated in our December 2015 #DataSciChat, and @KirkBorne for hosting. The data science community was well represented, with participants from around the globe ranging from newbies to distinguished experts interacting with panelists from Booz Allen, Kaggle, the Society of Women Engineers (SWE) and the American College of Cardiology (ACC). Read More
As data scientists, we look for stories within data. We use math, statistics, programming, and learning algorithms to uncover these stories. We love to discuss our explorations into data with those who will listen, but because of the esoteric nature of our work, our discoveries may not be widely heard or understood. To engage an audience, we should be great at visualizing and telling these stories.
Dr. Brené Brown is the author of The Gifts of Imperfection. Her work uses qualitative research to explore the human connection. She dubs her qualitative research stories as “just data with a soul.” Her work references the descriptive rather than the numeric aspect of qualitative knowledge. She discusses the power of vulnerability in her The Power of Vulnerability TED Talk. Read More
Every 24 hours, your heart beats approximately 100,000 times. No matter whether it was a day of triumph or defeat, discovery or pursuit, love or loss, you can count on relatively 100,000 reminders that you have embarked, once again, upon another chapter of the best story ever told – yours. In a year, that remarkable force will take 2.5 billion steps toward the future. So, though our hearts may, at times, yearn for the past or a present that never was, its beating drives us ever forward. And that is a journey that must be protected at all costs. Read More
It’s time for us to push the envelope as a data science community. We’ve proven our ability to find the most obscure of facts (e.g., humans share 50% of their DNA with bananas). We’ve uncovered patterns in untamable datasets that lead to ground-breaking insights. We’ve even learned how to predict the future.
Simulation Modeling is a structured approach to discovering key variable relationships within a system. Systems take on many forms across sectors, from agriculture to aerospace and defense to zoology. These systems are generally finite and operate within a set of defined business rules, often forcing decision makers to make difficult tradeoffs that can result in a range of profitable, or costly, outcomes. Read More
Booz Allen Hamilton’s first annual Data Science Bowl attracted more than a thousand teams worldwide to compete in developing the best computer-based visual recognition system. The competition was hosted on a Kaggle platform for the duration of three months. Read More
With the next Data Science Bowl just around the corner, I set out to prepare myself for the competition. The truth: I’m not a coder.
I have an interest in data science. I appreciate the process—and the results. I’m open to advice from the best of the best. With that in mind, I set out to find the top tips and tricks buried within last year’s competition forums. What I found is a treasure trove for anyone who is going to participate, from beginner to seasoned pro. Read More
One of the questions I often get from people after giving a talk is “How do I get started with deep neural nets?” Building new deep neural nets is a three-part process: learn the theoretical concepts of the model, play with toy models, and put it into practice by building your own models.
Data science is often regarded as an elite field. It draws experts with advanced degrees in math and science, fluency in multiple programming languages, and a firm grasp of statistics and probability theory. To reach the top of a Kaggle challenge like the Data Science Bowl is to demonstrate a feat of technical wizardry. Read More
Data Science is powerful. By combining the fields of statistics and computer science it allows us to analyze and understand data and make that data understandable to others. This means Data Scientists can direct the public’s sightlines to particular trends or information. One particular trend in data science, and STEM in general, worth mentioning is: Despite the growing field, only a small fraction of the STEM workforce consists of minority groups. Read More
Modern computing has no shortage of tools for the data scientist. The open source community alters the landscape every six to 12 months, and competition keeps you on the bleeding edge. In my career as a data scientist, I use everything from scientific Python™ packages to the newest cloud computing architectures—and sometimes all within the same project, as the initial stages of data exploration and mining are often done in a different language than the final product implementation. Read More
Overfitting is an issue within machine learning and statistics. It occurs when we build models that closely explain a training data set, but fail to generalize when applied to other data sets. Overfitting is a part of life as a data scientist. We all do it to some degree or another. In the case of forecasting in data science competitions, it might actually be advantageous to overfit to Kaggle’s public leaderboard. However, if you have an independent, and identically distributed (iid), split between train and test data sets, then it’s probably better to come up with a leak-free cross-validation (CV) scheme. Read More
There has been a lot of news coverage lately around the topic of creating a data-driven culture within an organization. The fact of the matter is a data-driven culture is crippling. We tried to create a data-driven culture too, but ultimately found that our real transformation came by using data as inputs into a real Analytics driven-culture. A culture that values true experimentation, understands failure is the price of discovery, and actually makes use of analytic outputs for decision-making. Read More
Support Vector Machines (SVMs) may not be as popular as Neural Networks within data science, but they act as powerful, useful algorithms. One of the difficulties of SVMs has been the computational effort required to train them. However LIBSVM, which has been used for over a decade, can fairly easily handle the 30,000 training points in the National Data Science Bowl competition’s data set. That makes SVMs a viable tool for you to use both in general, and for the purpose of competing in the Data Science Bowl. Read More
We were incredibly excited to finally meet our data science partners from Booz Allen and Kaggle face-to-face during their site visit on January 15-16, 2015. We had communicated for months prior to their visit, discussing data, analytics, the number of plankton classes for the data science competition, and the general ins and outs of the Data Science Bowl.
Building on Paul Yacci’s earlier post on the importance of feature selection in data science and data analysis, the creation of new features from your existing data set can play a large role in the performance of your model in data science. There are multiple methods of feature creation and data transformation. Often, finding the right transformation of your data can reveal relationships that would be difficult see otherwise, and may also make it easier for your model to separate classes. Read More
Data Science was named Forbes Magazine’s sexiest profession of 2014 as well as being the most trending STEM career in Ebony Magazine’s July issue. This has led to many wondering how they, too, can enter the data science profession. So, what is a typical day in the life of a data scientist? Read More
Being a data scientist is more than having a technical background; it’s also about going beyond your tools and understanding what it really means to tackle complex data analysis problems. No matter if you are a seasoned big data expert or are just considering moving into the field, here are five things you ought to know. Read More