Amount of data versus Information

P. Schmälzle and F. Herrmann; Abteilung Didaktik der Physik, Universität Karlsruhe, 76128 Karlsruhe, Germany

Abstract
It is argued to use the word "amount of data" instead of "information" for the quantity introduced by C. Shannon.


C. Shannon [1] defined the quantity and called it "entropy of an information source" or "entropy of probabilities p1, ... pn". However, it has soon become customary to call this quantity "information" [2], [3], [4]. We will try to show in this note that the word "information" is inappropriate and argue in favour of the name "amount of data".
When a name has to be given to a physical quantity often a word from the colloquial, i.e. non-scientific language is chosen. This entails that the meaning of the word is at the same time restricted and defined more precisely. Choosing a word from common language to denote a physical quantity can help the learner to acquire an intuitive idea about the quantity. An improperly chosen word, however, can mislead the intuition of the learner. The word "information" for the quantity H is an example for this latter case.
This can be seen by considering the following sentence, which is typical for the use of the word information in common language: "He gave me an important information about an affair." This sentence shows that there is a discrepancy between the common language use and the scientific use of the word information in several points.
1) In this sentence "information" is used to term what in communication theory is called a (single) message. However, the quantity H of equation (1) is not a message but a measure of an amount of messages like mass is not matter but a measure of an amount of matter.
2) The adjective "important" in the above sentence shows, that the content of the message is evaluated according to a human criterion. The same message could be qualified as "unimportant" by another person. However, communication theory doesn't take into account any human evaluation of the transmitted data.
3) The sentence cited above shows, that one always gets information about something, in our example about an affair, or, expressed more formally: system A transmits information to system B about a system C. This way of speaking causes difficulties, if one asks to which system the quantity H has to be attributed. Shannonís quantity H clearly is attributed to one physical system, e.g. a data source. The value of H is determined by the probability distribution {pi}. It also makes sense to speak about the flow of the quantity H from one system A to another system B. However, there is no need to mention a further system C, to which the information might refer.
We have observed that these discrepancies are the origin of wrong conclusions not only by beginners but also by people who are experienced in thermodynamics and communication theory.
Thus, we propose a name for the quantity H which avoids the above mentioned difficulties: "amount of data". This name is formed in analogy to the well-established name "amount of substance". The part "amount" of the name indicates that not a single message is meant, but a measure of an amount of messages. The part "data" suggests that the meaning of the messages is left out of consideration. For the same reason, by the way, Schopenhauer [5] designated the signals transmitted between the sensory organs and the brain with the Latin word "data". Thus, speaking about data is speaking about single messages without taking into account any human evaluation.
Using the word data in our context has another advantage: today, many word combinations beginning with "data" are in common usage, e.g. data processing, data storage, data transmission, data line etc. Thus, to call the quantity H "amount of data" is a natural extension of such word combinations.
One might propose, however, to call the quantity H entropy, as it was done originally by Shannon. One more argument in favour of this name would be that the entropy of a thermodynamical system can be interpreted as "information" in the sense of Shannon's definition [6], [7]. The "entropy" in which the communications engineer is interested is simply an additive term in the total "physical" entropy of a system. (In general, this term is very small compared to the total entropy.) Since for communication science only this (small) term matters, we think it is justified to give it a name of its own.

References

[1] C. Shannon, The mathematical theory of communication. Urbana: University of Illinois Press, 1949.
[2] M. Tribus, Thermostatics and thermodynamics. Princeton: D. van Nostrand Company, 1961.
[3] G. Raisbeck, Information theory. Massachussetts: MIT Press, 1963.
[4] H. Haken, Synergetics. Berlin: Springer Verlag, 1977.
[5] A. Schopenhauer, Die Welt als Wille und Vorstellung. Zürich: Diogenes, 1977, Vol II, pp. 33?37.
[6] E.T. Jaynes, Information Theory and Statistical Mechanics. Phys. Rev. 106,  620-630, 1957.
[7] F. Herrmann, An Analogy between Information and Energy. Eur. J. Phys. 7, 174-176, 1986.