Blog - Garbage in, garbage out: do you need real/true data to extract useful information?

Updated: May 6


The adage is well known in computer science: garbage in, garbage out (aka GIGO). If data inputted in a program is of poor quality, then the result of the program will be of poor quality. While this idea may seem obvious, it must be interpreted with nuance.


Data in, information out


In 2015, AlphaGo was the first computer to beat a professional Go gamer. To do this, the different neural networks that made up his gaming algorithm had been trained with examples of high-level Go games. Obviously, if they had been trained with chess games, even at the highest level, AlphaGo would never have beaten a Go professional. We would have been faced with a GIGO case, in a rather trivial understanding.


Nevertheless, trained in this way with chess games, he would probably have been a formidable opponent on 64 squares... We see of course with this very simple case that the quality of the input data is a notion eminently dependent on the objective, i.e. the information that we want to obtain at the output.


Blockchain, data accuracy and supply chain


When we talk about blockchain and supply chain traceability, the question of "data accuracy" quickly comes up. Since data recorded in a blockchain cannot be modified afterwards (see our article Blockchain 101), it is common to assume that recorded data must always be true, otherwise (garbage in) all traceability would be wasted (garbage out). Yet this apparent evidence does not mean much...


In this context, blockchain ensures that there were not any modifications to the records about who declared what and when. In other words, it is the validity of the data that is guaranteed (origin, integrity, temporality) but in no case its accuracy. It so happens that in the case of crypto-currencies the two notions overlap: but it is a different scenario because crypto-currency does not exist outside of its blockchain network. In the case of supply chain traceability, the data present in a blockchain is a numerical representation of the physical world: the coherence of the two cannot be guaranteed absolutely except in trivial cases.


Indeed, supply chain traceability faces several complex challenges:


· the enormous fragmentation of modern industrial chains: up to several dozen intermediaries, with a low level of integration and knowledge, even between intermediaries at 2 or 3 rows apart;


· the volume of physical flows and data that requires processing to scale;


· the heterogeneity of data collection (ERP, IoT, Excel, paper, manual input, SSCC scanning, etc.) and in terms of life cycle.


· errors, breakdowns or frauds, an operational reality and quite obvious if you are pragmatic.


With these constraints, it is therefore misleading to try to guarantee the accuracy of all the data we collect. If an IoT sensor malfunctions, it can send false data; if there is an ERP input error, the data will be false; if a supply chain partner cheats, they will provide false data, etc.


In reality, once the data has been collected, it is necessary to know how to bring it to life and make it speak, whether it is true or not. The knowledge we will be able to obtain from it will depend on how we decide to look at it, and therefore on the type of information we seek. Slightly changing the angle of reading will sometimes be enough to deduce new and usable information from a data "that does not speak".


I doubt therefore I analyse...


Data validity, on the other hand, makes it possible to define a mechanism for analysis, understanding and continuous improvement of the supply chain. Indeed, valid data make it possible to build a stable image of the process, which can then be analysed and improved.


Take for example the destruction of a recalled product in a warehouse. Blockchain guarantees that the destruction has been declared but does not guarantee the destruction itself. What is therefore the quality of the data describing the declaration of destruction? If the destruction did not take place but the declaration data claims that it took place, is this data of poor quality (garbage in)?


In practice, different algorithms will be used to build confidence indices on the data: comparison of independent data sources to check their consistency, application of expected models to each source (statistical model of time behaviour or specifications, for example), identification of "outliers" in the data. Depending on the confidence index, potential anomalies are identified and qualified: they may relate to a data source, the evaluation of the confidence index or the destruction process. By correcting each anomaly, we will improve either the process or its control. This approach can be seen as a pragmatic process of continuous improvement in the supply chain.


The validity of the data, and the fact that they cannot be modified, is a cornerstone of this approach. If the validity of the data could be compromised, no improvement would be possible since any analysis could be called into question following a modification of the basic data: foundations are not built on quicksand.


Take a step aside


Garbage in, garbage out induces the transformation of data into information. The quality of this transformation can be analysed in terms of the information we want to obtain: the in and out are neither absolute nor independent, everything depends on the adequacy between input data and the information looked for in the output. Yet as a result of this adage, there is a tendency to stack promises on input data and their "accuracy", without adapting the information sought. This is particularly true in different blockchain approaches. It is also possible to take a step aside, take a different perspective, and question what information can be extracted from a data set.