Avoiding murky data science projects

Paying for data science projects can be stressful.

The dream is straightforward application of existing, high-power statistical and machine learning technologies to relevant, existing data. Flip a switch and out pops tools for better decision making or new products for your customers. The reality is that none of that is straightforward and, in the worst cases, you end up in research hell.

A big part of data science is learning and exploration. You may not know what you don’t know, you may not know what opportunities exist in data that you have access to (or could easily get access to). So, when you charter a data science team to solve a business problem you may be setting off on a long, murky journey.

Research hell is when these projects struggle to deliver, but stay tantalizing. You invest and invest and wait and wait and the project trudges on.

And on. And on. And on… Misery.

Now, instead of jumping on whole new opportunities born of your investment into data you’re nursing a murky plan and managing a distressed, disconnected long-term research team. Or, worse, left judging a science fair.

Clear quantitative questions don’t come for free

What you need is a super well-defined quantitative problem to send your super well-defined quantitative team after, right?

This, of course, works and is the setting of many data science success stories. Marketing teams have been taking advantage of A/B testing for years to demystify the murky world of customer preference. Quality control in manufacturing has whole textbooks written on the specific, quantitative needs faced in estimating and mitigating defect rates. And don’t even get started on quantitative finance which is rife with clear quantitative needs.

The common thread in all of those is that they are commoditized data science and the result of sometimes decades of market evolution and research. These clear quantitative questions didn’t come for free.

Most business and product questions haven’t already been commoditized—and, to be clear, that’s part of what makes them valuable! But it’s also precisely why you can’t expect to consistently come up with great, well-scoped data science questions.

Insight is a process

The truth of the matter is that learning and insight is a process and for the questions you have you may be at the very beginning of that process. The field of informatics likes to discuss the DIKW Pyramid of

  • Data composed of raw signals, numbers, observations, facts,
  • Information as data which is structured, organized, compared, interrogated,
  • Knowledge being information which is framed and understood, contextualized and leveraged, reduced to practice, and
  • Wisdom being integrated, generalized, tempered knowledge; perhaps elusive.

To my dollar, DIKW is more interesting as a story. You can get into arguments about where one element ends and the next begins, or you can run with the main message: that achieving insight is a process.

The various data and data processes you own and develop evolve and become refined through efforts like exploration, curation, and organization. As these outputs—data products, really—"climb the pyramid" it becomes easier to take hold of the crisp quantitative impact that they have.

Scoping high-value data science projects

With something like the DIKW Pyramid in hand, we can take a different view on data science projects. There are many kinds and they live along a continuum.

On the "highly developed" end we see those commoditized problems like defect rate estimation or recommender systems which, if they apply to your needs, are crisp and efficient. On the "highly unexplored" end we see data gathering, rough cuts, and the call of greenfield projects that could change everything.

If you try to build endpoints and objectives for a project without understanding where it exists on this development continuum you’re likely aiming for disaster. The objective of a team working with raw data needs to be fast, agile, and rudimentary. The objective of a team working with highly prepared data and working knowledge needs to be heavily integrated, optimized, and effective.

And even if your real goal is to take something from the bottom of the pyramid to the top, by making space in the project for this development you can help the project to breathe. Instead of radio silence for a year, the team should come up for air, deliver new learnings and value to important stakeholders, and reorient its approach.

To use a metaphor from product design, we should build prototypes before high-volume production runs. Prototypes are key because they help de-risk bigger investments in a cheaper way while also laying down the scaffolding for that very future work. Trying to rush to the end of this process and skipping steps is a known recipe for failure.

Software organizations know that "pioneering" work is different from "production" work. The goals are different, the timescales change, the management processes and sometimes even ideal personalities might all need to shift.

Putting this into practice

The first step to putting this into practice is to take stock of what you have and where you’re trying to go.

Given the way that software is eating the world nowadays, you likely have valuable data all around your org. If you’re building software, your teams might already have mechanisms for collecting data or could generate them. If you’re not, then it’s likely that your software vendors are collecting it on your behalf.

Or perhaps you’re already way ahead and have been developing or buying more advanced products on your own.

Take a few minutes to take stock of these nascent and growing data products. You can make a list of data products, their relationship to the goals of your team or organization, and then make an estimate of their level of development. Think about the goals and outcomes of current efforts. Are there ways to get to a win sooner by considering where they’re at along the process of development?

Further Reading

  • Checklists for data product staging lists questions useful for evaluating the level of development of a given data product. It can be a starting point for estimating the projects you have or want to begin. It can also inspire new directions for further refinement of products you already have.