An algebraic formalization of distributed processing of big data is considered. The concept of information space is defined for a given data processing procedure and a criterion for its minimality is established. The existence of a minimal information space is proved, which provides the most compact form of representation of the information contained in the data and allows the most efficient parallelization of data processing. An element of this space describes in a consistent way the information contained in the corresponding data set. It is shown that in terms of the information space, the concepts of information addition and information quality are naturally expressed, reflecting the intuitive idea of the very concept of information. The advantages of using the minimal information space in the MapReduce distributed data processing model are also considered. In the context of this model, Map transforms the original data sets into information space elements, and Reduce combines all these pieces of partial information into a single element representing all the original data. By way of illustration, several examples of data processing procedures are analyzed and the corresponding minimal information spaces are presented.
07.05.Kf Data analysis: algorithms and implementation; data management
$^1$Department of Mathematics, Faculty of Physics, M.V.Lomonosov Moscow State University