How Useful is Your Data?
George Box famously remarked: “All models are wrong, but some are useful”, a testament to the complexity of using data and statistical models for prediction. It is the doctrine embraced by applied statisticians yet receives less attention outside the statistics community. The quote addresses the challenges everyone faces when working through the improvement phase of a project regarding whether a data generated model represents the actual/true situation. It is also a cautionary tale highlighting the vigilance required to ensure enough of the right data are collected and analyzed correctly. The statement also emphasizes the importance of validating the statistical result to ensure it reflects the intended result.
For people who consider statistical analysis as tomfoolery, the above quote might be interpreted to have a different meaning: “All models are wrong because data can be manipulated to support a predetermined conclusion”. Statistics provides an estimate based on information and data analysis. Hence, statistical models are an approximation based on the information gathered, with some models more adequate in their intended purpose than others. If a model is a good fit, the predicted value is a good estimator of the observed value. Developing useful models require foresight, planning, and special attention to data detail.
Impact of Data Systems and Software
Data are now readily available thanks to software advances which allow for collection of process inputs and outputs. At times, human intervention is necessary to assess an output and determine whether an input change is required. In a closed loop system, outputs are measured continuously and are fed back to the input. Designed to achieve and maintain a desired state or set point without human intervention, these systems are becoming more prevalent predominant for controlling production processes. For closed loop systems, feedback occurs whenever the calculated difference between the desired and actual determines an error, prompting a command for compensation based on algorithms and artificial intelligence. It is important to remember data captured in either an open or closed loop system are (or are meant to be) taken in the same manner every time. The data sample or signal must also represent the larger population being profiled. Data usefulness depends on this fundamental requirement as well as the ability for the data to explain the problem.
How important are data? It is generally agreed that the lack of proper data is a major hindrance in producing sound statistical conclusions. The introduction of point and click statistical software and the availability of copious amounts of historical data may be useful in certain situations. It also may lead to data manipulation or data analysis results not useful in solving the problem. For this reason, it is every improver’s responsibility to understand the source and facts related to all historical data, and to assess whether the data represent the information needed to develop sound conclusions. They also must take responsibility to ensure data not readily available but needed for problem solving are collected.
What causes smart people to overlook data fundamentals? Experience suggests it is done unknowingly. In today’s world of desire for immediate answers, the combination of substantial amounts of stored historical data and drop-down statistical software menus are a perfect combination that result in data manipulation, data torture and a poor conclusion. Statistical programs allow for easy data analysis with help instructions for guidance in the setup, running and interpretation of a test. The statistical theory for the tests and the underlying assumptions required are less evident. Therefore, those with no or little formal statistics training do not see the danger. (A notable example is the widely incorrect use of capability indices such as Cp, Cpm, Cpk, etc. which should never be used unless all underlying assumptions are met!)
How to Minimize Data Risks
Is data analysis the first step? There is a fundamental consideration the improver must understand to be successful. Data analysis and conclusion are the last steps in a sequence of steps. Below are important considerations essential prior to data analysis and interpretation to ensure your data results will represent a true approximation of a studied situation:
- Understand the Issue and the Intended Outcome: Too many times a phenomenon fondly called, “Ready, Shoot, Aim” occurs because of the urgency to begin improvement work even before fully understanding the complexities of the issue or the desired outcome. In many instances, initial discussions oversimplify the problem and understate the desired end state. Lack of understanding impacts the big picture, perceived needs, and data layout requirements.
- Know the Process: An improver must understand the process and be capable to articulate process specifics such as layout, steps, key inputs, key outputs, and the specifics of what and how variables are collected. Key stakeholders should be interviewed and involved to minimize the risk of improver bias and tunnel vision.
- Formulate Answers Needed from Data: Data provide detail and answers to questions about variables that impact a process and add clarity for each puzzle piece required to put the improvement picture together. Investigation may lead to variables that have never been tracked/studied or ones that need to be studied in a new way.
- Determine Best Way to Use Historical Data: Its usefulness will depend on collection plan, frequency, method, and the culture associated with data collection. If historical data are deemed valid but do not provide enough granularity or are not the right type of data, it may still be useful to understand variation and provide insight and direction for the process and future data needs.
- Develop Studies and Plans: Consider the big picture to avoid missed opportunities to learn about how variables interact. Studies should be well planned and comprehensive to collect variables which may not be currently collected but experience suggests they are important to the process or may impact other variables.
- Be Collection Savvy: Many things can happen during studies and data collection, even in carefully laid out collection designs. It happens so it is important to understand any deviations from sampling and design plans and how the change may impact data analysis and interpretation. Knowledge of signals and algorithms in control systems are also important.
- Have an Unbiased Approach to Analysis: An entire blog can be written on ways improvers unknowingly manipulate the integrity of an analysis. Software will always provide an answer. But does the answer align with the actual?
- Follow the principles for good statistical analysis. For example: Plot the data to look for sources of variation – common, assignable, and structural. Ensure the data are not over modeled by removing too many data points. Transform data only when appropriate. Graph the data and do not base decisions solely on the test p-value.
- Refrain from the mentality of: “I know the answer and just have to prove it with the data”. Let the data drive the solutions. Beginning an analysis with the predetermined notion of the answer/outcome before looking at the data result in overlooking anything that does not support your “bias”. And highlights anything that may support your “bias”. Let the data analysis speak for itself.
- Doing every graph or analysis possible, even maximizing the number of graphs that can be generated. More is not better because each graph and test has a specific purpose – the tail should not wag the dog. Know the purpose for the analysis and run the appropriate tests and graphs. Of course, the results may warrant additional analysis, but the approach is now systematic.
If you want assistance in evaluating your data’s effectiveness to meet your business goals, feel free to contact me.
Leave a Reply