Lossless Compression of Data Tables in Mobile Devices by Using Co-clustering
Keywords:
data tables, lossless compression, co-clustering, redundancyAbstract
Data tables have been widely used for storage of a collection of related records in a structured format in many mobile applications. The lossless compression of data tables not only brings benefits for storage, but also reduces network transmission latencies and energy costs in batteries. In this paper, we propose a novel lossless compression approach by combining co-clustering and information coding theory. It reorders table columns and rows simultaneously for shaping homogeneous blocks and further optimizes alignment within a block to expose redundancy, such that standard lossless encoders can significantly improve compression ratios. We tested the approach on a synthetic dataset and ten UCI real-life datasets by using a standard compressor 7Z. The extensive experimental results suggest that compared with the direct table compression without co-clustering and within-block alignment, our approach can boost compression rates at least 21% and up to 68%. The results also show that the compression time cost of the co-clustering approach is linearly proportional to a data table size. In addition, since the inverse transform of co-clustering is just exchange of rows and columns according to recorded indexes, the decompression procedure runs very fast and the decompression time cost is similar to the counterpart without using co-clustering. Thereby, our approach is suitable for lossless compression of data tables in mobile devices with constrained resources.
References
Rao S., Bhat P. (2015); Evaluation of lossless compression techniques, IEEE International Conference on Communications and Signal Processing (ICCSP), 1655-1659. http://dx.doi.org/10.1109/iccsp.2015.7322799
Kontoyiannis I., Verdu S. (2014); Optimal lossless data compression: Non-asymptotics and asymptotics, IEEE Transactions on Information Theory, 60(2): 777-795. http://dx.doi.org/10.1109/TIT.2013.2291007
D. Salomon (2007); Data Compression: The Complete Reference, 4th ed. New York: Springer, 2007.
K. Sayood (2000); Introduction to Data Compression, 2nd ed. San Francisco, CA: Morgan Kaufmann, 2000.
Mielikainen J., Huang B.(2012); Lossless compression of hyperspectral images using clustered linear prediction with adaptive prediction length, IEEE Geoscience and Remote Sensing Letters, 9(6): 1118-1121. http://dx.doi.org/10.1109/LGRS.2012.2191531
Venugopal D., Mohan S., Raja S.(2016); An efficient block based lossless compression of medical images, Optik-International Journal for Light and Electron Optics, 127(2): 754-758.
Patauner C. et al. (2011); A lossless data compression system for a real-time application in HEP data acquisition, IEEE Transactions on Nuclear Science, 58(4): 1738-1744. http://dx.doi.org/10.1109/TNS.2011.2142193
Kolo J.G., et al. (2012); An adaptive lossless data compression scheme for wireless sensor networks, Journal of Sensors, vol. 2012, Article ID 539638, 1-20.
Kolo J.G. et al. (2015); Fast and efficient lossless adaptive compression scheme for wireless sensor networks, Computers & Electrical Engineering, 41: 275-287. http://dx.doi.org/10.1016/j.compeleceng.2014.06.008
A.L. Buchsbaum et al. (2000); Engineering the compression of massive tables: an experimental approach, Proceedings of the ACM-SIAM Annual Symposium on Discrete Algorithms, San Francisco, CA, 213-222.
A.L. Buchsbaum, G.S. Fowler, R. Giancarlo (2002); Improving table compression with combinatorial optimization, Proceedings of the ACM-SIAMAnnual Symposium on Discrete Algorithms, San Francisco, CA, 175-184.
Q. Yang, S. Lonardi (2005); A compression-boosting transform for 2D data, Data Compression Conference (DCC'05), 492-492.
H. Shan, A. Banerjee (2008); Bayesian co-clustering, Proceedings of the 8th IEEE International Conference on Data Mining (ICDM'08), 530-539.
F. Pan, X. Zhang, W. Wang (2008); CRD: fast co-clustering on large datasets utilizing sampling-based matrix decomposition, Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'08), 173-184.
A. Banerjee et al. (2007); A generalized maximum entropy approach to Bregman coclustering and matrix approximation, Journal of Machine Learning Research, 1919-1986.
I.S. Dhillon, S. Mallela, D.S. Modha (2003); Information-theoretic co-clustering, Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'03), 89-98.
R.G. Pensa, J.F. Boulicaut, F. Cordero, M. Atzori (2010); Co-clustering numerical data under user-defined constraints, Statistical Analysis and Data Mining,3(1): 38-55.
R.G. Pensa, J.F. Boulicaut (2008); Constrained co-clustering of gene expression data, Proceedings in Applied Mathematics 130, 8th SIAM International Conference on Data Mining 2008, 1:25-36.
Wu X., Zurita-Milla R., Kraak M.J. (2015); Co-clustering geo-referenced time series: exploring spatio-temporal patterns in Dutch temperature data, International Journal of Geographical Information Science, 29(4): 624-642. http://dx.doi.org/10.1080/13658816.2014.994520
http://en.wikipedia.org/wiki/7z
F. Coenen, The LUCS-KDD Discretised/normalized ARM and CARM Data Library, In http://www.csc.liv.ac.uk/~frans/KDD/Software/LUCS-KDD-DN/DataSets/dataSets.html
Published
Issue
Section
License
ONLINE OPEN ACCES: Acces to full text of each article and each issue are allowed for free in respect of Attribution-NonCommercial 4.0 International (CC BY-NC 4.0.
You are free to:
-Share: copy and redistribute the material in any medium or format;
-Adapt: remix, transform, and build upon the material.
The licensor cannot revoke these freedoms as long as you follow the license terms.
DISCLAIMER: The author(s) of each article appearing in International Journal of Computers Communications & Control is/are solely responsible for the content thereof; the publication of an article shall not constitute or be deemed to constitute any representation by the Editors or Agora University Press that the data presented therein are original, correct or sufficient to support the conclusions reached or that the experiment design or methodology is adequate.