Medical Research

Data Doesn’t Have to Hurt

Increasingly, data is seen as the holy grail of a modern business. Data evangelism has swept the media. An article posted in Wired suggested the ability to store and analyze petabytes of data has made causal analysis obsolete. Using Google, and all of its expensive analytical and mathematical tools, as an example, the article states that companies can “view data mathematically first, [and] establish context later.” As if all companies had pockets as deep as Google!

Articles like this give the sense that the more your company invests in data, the higher up you will be in the competition, and that data is inherently wonderful and just having it will bring success. You can be certain all the data you have invested in will benefit the company, so long as you also invest in pricey analytical tools to handle it.

As a data professional, these statements cause me significant concern. Data is rarely free. It often has an opportunity (capital or labour) cost associated with its collection; it is an investment and as many have experienced, not all investments work out. If your company cannot afford to spend Google-like amounts on new technology and data storage, that doesn’t mean a data-driven approach is unavailable to you. With careful planning and execution, you can increase the chance your data investment will pay off while keeping the costs reasonable.

Let’s start by asking, what is data? Data is defined as a collection of facts, observations, and statistics collected for reference or analysis. In this definition, we can see the first problem with data worship — data utility requires further reflection. Data that isn’t processed and analysed is worthless.

So, when does data have worth? When we get information from it. Information is the result of processing, organising, and analysing data so it can be used for decision making. It is created by taking data and turning it into something useful and of value. So, fundamentally, we don’t care about data; we care about information. What is my extraction efficiency, what is my average extraction rate per hour, and how does my extraction rate vary over time? Considering operating costs, should I extract to 75% yield? 90%? 99%? These questions are answered by information, not data.

Yet, for all we talk about information being the thing we want, it ultimately comes from the data. During my training, it was hammered into me that there is no such thing as “bad” data. Data is neither good nor bad; but it can be useful or not. This is the crux of the matter. Useful data can be turned into insight; useless data is an opportunity loss. On average, most companies only analyze 12% of their data, and this is a huge opportunity cost. That leaves 88% unused data that was collected and invested in by companies.

If you want to maximise your investment, here are some questions you should be asking of your data processes:

  1. Why are you collecting data? What do you want out of it, and what questions are you trying to answer? Unless you have a reason to collect data and you are going to use that data, don’t bother collecting it. Put your resources to better use, send an employee on a statistics course, have a company day out, just about anything is better than collecting data and not using it.
  2. Once you have a question, which data do you need to answer it? Assess which data streams you need, and what additional data streams might come for free. Free data is good, and you should certainly record it. However, make sure your focus is on the data you need and be prepared to spend to get it. An example might be that you are tracking feed material into a process, and an employee records the lot number and the weight. If this is entered electronically, you can capture the time that data was entered, and from there, you can capture how many batches were run that day. These are two examples of free data streams; they don’t cost any more in employee time or resources.
  3. Are you letting the data collection process change the original process? When you are collecting data, do it well and be consistent, but don’t let data collection change the original process. If there is a human component to the data collection, the observer effect can come into play. People’s behaviour often changes when they think they are under scrutiny. They may do things differently or fabricate how it was done because it was recorded. Is there a mismatch between your standard operating procedures and actual normal operation? This can be problematic because the data you are collecting does not accurately reflect normal operations. Try to use electronic platforms, try to automate the data collection, and if that’s not possible, try to focus on the processes, not the operators.
  4. Once you have analysed the results and answered your question, examine your data again. Paying particular attention to the free data, does it prompt any new questions? Back to question two. For example, using the free time stamps, does the time of day affect the output weight of the process? Does the number of batches affect the outcomes? Why is this happening? Is it beneficial? How can you bring the rest of the process up to the new standard? What other data do you need to collect to answer these new questions?

Some of this may sound a little negative and that I am against data; this is not the case. Data does hold the potential for substantial business improvements, but it needs to be the right data and it needs to be used.

For example, my colleagues and I have worked with producers needing help optimizing their extraction process who historically kept detailed logs of the set conditions for their extractors. They proudly gave us all their data which we had to discard in its entirety because there was no record of the actual conditions obtained, or the chemical analyses of the input and output material. All we were given was a log of the mass that went in, their intent, and the mass that was obtained.

It is possible that this data was suitable for some original purpose, perhaps ensuring correct conditions were being chosen for each run and record keeping for inventory management. However, for process optimisation, this data was woefully inadequate. This is where question two is important. You should just collect the data you really need and be prepared to gather new data streams as your questions change. If you are collecting data with the intent to answer every possible question, you are going to needlessly increase your operational costs. In all likelihood, you will collect all that you can think of, then as you develop new questions, you will realise you need more data streams anyway. If you somehow manage to collect enough data to answer every possible question that you will ever have, then you should make your entire business data generation and forget about your product.

Overall, I am advocating for companies to maximise their data utility in a way that minimizes their opportunity cost. Collecting more focused data streams means spending less time and money on collecting data, which in turn will allow you to spend those savings in the future on adding new, focused data streams as your questions evolve. This approach to data collection allows you to ensure you are getting the information you need to make important business decisions and are not falling behind your competitors, but in a way that minimizes the cost and storage of useless data.

About the author

Tom Dupree Ph.D., Delic Labs