A Feature Engineering and Ensemble Learning Based Approach for Repeated Buyers Prediction
DOI:
https://doi.org/10.15837/ijccc.2022.6.4988Keywords:
feature engineering; ensemble learning; fusion model; repeat buyer predictionAbstract
The global e-commerce market is growing at a rapid pace, but the percentage of repeat buyers is low. According to Tmall, the repurchase rate is only 6.1%, while research shows that a 5% increase in the repurchase rate can lead to a 25% to 95% increase in profit. To increase the repurchase rate, merchants need to predict potential repeat buyers and convert them into repurchasers. Therefore, it is necessary to predict repeat buyers. In this paper we build a prediction model of repeat purchasers using Tmall’s dataset. First, we build high-quality feature engineering for e-commerce scenarios by manual construction and algorithmic selection. We introduce the synthetic minority oversampling technique (SMOTE) algorithm to solve the data imbalance problem and improve prediction performance. Then we train classical classifiers including factorization machine and logistic regression, and ensemble learning classifiers including extreme gradient boosting, and light gradient boosting machine machines. Finally, we construct a two-layer fusion model based on the Stacking algorithm to further enhance prediction performance. The results show that through a series of innovations such as data imbalance processing, feature engineering, and fusion models, the model area under curve (AUC) value is improved by 0.01161. Our findings provide important implications for managing e-commerce platforms and the platform merchants.
References
Abel, F.; Gao, Q.; Houben, G. J.; Tao, K. (2011). Analyzing user modeling on twitter for personalized news recommendations. International Conference on User Modeling, Adaptation, And Personalization, 1-2, 2011
https://doi.org/10.1007/978-3-642-22362-4_1
Belem, F. M.; Silva, R. M.; de Andrade, C. M.; Person, G.; Mingote, F.; Ballet, R.; Alponti, H.; de Oliveira, H. P.; Almeida, J. M.; Goncalves, M. A. (2020). "Fixing the curse of the bad product descriptions"-Search-boosted tag recommendation for E-commerce products. Information Processing Management, 57(5), 102289, 2020
https://doi.org/10.1016/j.ipm.2020.102289
Benevenuto, F.; Rodrigues, T.; Cha, M.; Almeida, V. (2009). Characterizing user behavior in online social networks. Proceedings of the 9th ACM SIGCOMM Conference on Internet Measurement, 49-62, 2009
https://doi.org/10.1145/1644893.1644900
Bhattacharya, C. B. (1998). When customers are members: Customer retention in paid membership contexts. Journal of The Academy of Marketing Science, 26(1), 31-44, 1998
https://doi.org/10.1177/0092070398261004
Breiman, L. (1996). Bagging predictors. Machine learning, 24(2): 123-140, 1996
https://doi.org/10.1007/BF00058655
Cao, W.; Wang, K.; Gan, H.; Yang, M. (2021). User online purchase behavior prediction based on fusion model of CatBoost and Logit. Journal of Physics: Conference Series, 2003(01), 012011, 2021
https://doi.org/10.1088/1742-6596/2003/1/012011
Carta, S.; Fenu, G.; Recupero, D. R.; Saia, R. (2019). Fraud detection for E-commerce transactions by employing a prudential Multiple Consensus model. Journal of Information Security and Applications, 46, 13-22, 2019
https://doi.org/10.1016/j.jisa.2019.02.007
Chen, S.; Wang, J. Q.; Zhang, H. Y. (2019). A hybrid PSO-SVM model based on clustering algorithm for short-term atmospheric pollutant concentration forecasting. Technological Forecasting and Social Change, 146, 41-54, 2019
https://doi.org/10.1016/j.techfore.2019.05.015
Chou, P.; Chuang, H. H. C.; Chou, Y. C.; Liang, T. P. (2022). Predictive analytics for customer repurchase: Interdisciplinary integration of buy till you die modeling and machine learning. European Journal of Operational Research, 296(2), 635-651, 2022
https://doi.org/10.1016/j.ejor.2021.04.021
Daly, J. L. (2002). Pricing for profitability: Activity-based pricing for competitive advantage. John Wiley & Sons, 2002.
Dasarathy, B. V.; Sheela, B. V.(1979). A composite classifier system design: Concepts and methodology. Proceedings of the IEEE, 67(5): 708-713, 1979
https://doi.org/10.1109/PROC.1979.11321
Deng, Z. H.; Huang, L.; Wang, C. D.; Lai, J. H.; Philip, S. Y. (2019). Deepcf: A unified framework of representation learning and matching function learning in recommender system. Proceedings of The AAAI Conference on Artificial Intelligence, 33(01), 61-68, 2019
https://doi.org/10.1609/aaai.v33i01.330161
Dong, J.; Huang, T.; Min, L.; Wang, W. (2022). Prediction of Online Consumers' Repeat Purchase Behavior via BERT-MLP Model. Journal of Electronic Research and Application, 6(3), 12-19, 2022
https://doi.org/10.26689/jera.v6i3.4010
Dong, X.; Yu, Z.; Cao, W.; Shi, Y.; Ma, Q. (2020). A survey on ensemble learning. Frontiers of Computer Science, 14(2), 241-258, 2020
https://doi.org/10.1007/s11704-019-8208-z
Dong, Y.; Jiang, W. (2019). Brand purchase prediction based on time-evolving user behaviors in e-commerce. Concurrency and Computation: Practice and Experience, 31(1), e4882, 2019
https://doi.org/10.1002/cpe.4882
Enrich, M.; Braunhofer, M.; Ricci, F. (2013). Cold-start management with cross-domain collaborative filtering and tags. International Conference on Electronic Commerce and Web Technologies 101-112, 2013
https://doi.org/10.1007/978-3-642-39878-0_10
Fernández-Tobías, I.; Cantador, I. (2014). Exploiting Social Tags in Matrix Factorization Models for Cross-domain Collaborative Filtering. Proceedings of the 1st Workshop on New Trends in Content-based Recommender Systems, 34-41, 2014
Gajsek B.; Dukic G.; Kovacic M.; Brezocnik M. (2021). A Multi-Objective Genetic Algorithms Approach for Modelling of Order Picking. Int. Journal of Simulation Modelling, 20(4), 719-729, 2021
https://doi.org/10.2507/IJSIMM20-4-582
Hansen, L. K.; Salamon, P. (1990). Neural network ensembles. IEEE transactions on pattern analysis and machine intelligence, 12(10): 993-1001, 1990
https://doi.org/10.1109/34.58871
Jacobs, R.; Jordan, M.; Nowlan, S.; Hinton G. (2014). Adaptive mixtures of local experts. Neural Computation, 3(1): 79-87, 1991
https://doi.org/10.1162/neco.1991.3.1.79
Janekova J.; Fabianova J.; Kadarova J. (2021). Selection of Optimal Investment Variant Based on Monte Carlo Simulations. Int. Journal of Simulation Modelling, 20(2), 279-290, 2021
https://doi.org/10.2507/IJSIMM20-2-557
Kagan, S.; Bekkerman, R. (2018). Predicting purchase behavior of website audiences. International Journal of Electronic Commerce, 22(4), 510-539, 2018
https://doi.org/10.1080/10864415.2018.1485084
Knezevic, B.; Skrobot, P.; Pavic, E. (2021). Differentiation of e-commerce consumer approach by product categories. Journal of Logistics, Informatics and Service Science, 8(1), 1-19, 2021
Kocheturov, A.; Pardalos, P. M.; Karakitsiou, A. (2019). Massive datasets and machine learning for computational biomedicine: trends and challenges. Annals of Operations Research, 276(1), 5-34, 2019
https://doi.org/10.1007/s10479-018-2891-2
Koehn, D.; Lessmann, S.; Schaal, M. (2020). Predicting online shopping behaviour from clickstream data using deep learning. Expert Systems with Applications, 150, 113342, 2020
https://doi.org/10.1016/j.eswa.2020.113342
Koren, Y. (2008). Factorization meets the neighborhood: a multifaceted collaborative filtering model. Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 426-434, 2008
https://doi.org/10.1145/1401890.1401944
Kumar, A.; Kabra, G.; Mussada, E. K.; Dash, M. K.; Rana, P. S. (2019). Combined artificial bee colony algorithm and machine learning techniques for prediction of online consumer repurchase intention. Neural Computing and Applications, 31(2), 877-890, 2019
https://doi.org/10.1007/s00521-017-3047-z
Kyriakou, I.; Mousavi, P.; Nielsen, J. P.; Scholz, M. (2021). Forecasting benchmarks of long-term stock returns via machine learning. /emphAnnals of Operations Research, 297(1), 221-240, 2021
https://doi.org/10.1007/s10479-019-03338-4
Li, X.; Hitt, L. M.; Zhang, Z. J. (2011). Product reviews and competition in markets for repeat purchase products. Journal of Management Information Systems, 27(4), 9-42, 2011
https://doi.org/10.2753/MIS0742-1222270401
Liu, X.; Li, J. (2016). Using support vector machine for online purchase predication. Emph2016 International Conference on Logistics, Informatics and Service Sciences, 1-6, 2016
https://doi.org/10.1109/LISS.2016.7854334
Ma X. Y.; Lin Y.; Ma Q. W. (2021). Data-Driven Robust Model for Container Slot Allocation with Uncertain Demand. Int. Journal of Simulation Modelling, 20(4), 707-718, 2021
https://doi.org/10.2507/IJSIMM20-4-581
Martínez, A.; Schmuck, C.; Pereverzyev Jr, S.; Pirker, C.; Haltmeier, M. (2020). A machine learning framework for customer purchase prediction in the non-contractual setting. European Journal of Operational Research, 281(3), 588-596, 2020
https://doi.org/10.1016/j.ejor.2018.04.034
Moriuchi, E.; Takahashi, I. (2022). An empirical study on repeat consumer's shopping satisfaction on C2C e-commerce in Japan: the role of value, trust and engagement. Asia Pacific Journal of Marketing and Logistics, ahead-of-print, 2022
https://doi.org/10.1108/APJML-08-2021-0631
Ni, Y.; Chen, X.; Pan, W.; Chen, Z.; Ming, Z. (2021). Factored heterogeneous similarity model for recommendation with implicit feedback. Neurocomputing, 455(2021), 59-67, 2021
https://doi.org/10.1016/j.neucom.2021.05.009
Oyewole, S. A.; Olugbara, O. O. (2018). Product image classification using Eigen Colour feature with ensemble machine learning. Egyptian Informatics Journal, 19(2), 83-100, 2018
https://doi.org/10.1016/j.eij.2017.10.002
Sagi, O.; Rokach, L. (2018). Ensemble learning: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4), e1249, 2018
https://doi.org/10.1002/widm.1249
Sakar, C. O.; Polat, S. O.; Katircioglu, M.; Kastro, Y. (2019). Real-time prediction of online shoppers' purchasing intention using multilayer perceptron and LSTM recurrent neural networks. Neural Computing and Applications, 31(10), 6893-6908, 2019
https://doi.org/10.1007/s00521-018-3523-0
Schapire, R. E.; Freund, Y. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1): 119-139, 1997
https://doi.org/10.1006/jcss.1997.1504
Shen, Y.; Xu, X.; Cao, J. (2020). Reconciling predictive and interpretable performance in repeat buyer prediction via model distillation and heterogeneous classifiers fusion. Neural Computing and Applications, 32(13), 9495-9508, 2020
https://doi.org/10.1007/s00521-019-04462-9
Tripathi, P.; Singh, S.; Chhajer, P.; Trivedi, M. C.; Singh, V. K. (2020). Analysis and prediction of extent of helpfulness of reviews on E-commerce websites. Materials Today: Proceedings, 33, 4520-4525, 2020
https://doi.org/10.1016/j.matpr.2020.08.012
Van Nguyen, T.; Zhou, L.; Chong, A. Y. L.; Li, B.; Pu, X. (2020). Predicting customer demand for remanufactured products: A data-mining approach. European Journal of Operational Research, 281(3), 543-558, 2020
https://doi.org/10.1016/j.ejor.2019.08.015
Wolpert, D. H. (1992). Stacked generalization. Neural networks, 5(2): 241-259, 1992
https://doi.org/10.1016/S0893-6080(05)80023-1
Wu P. J., Yang D. (2021). E-Commerce Workshop Scheduling Based on Deep Learning and Genetic Algorithm. Int. Journal of Simulation Modelling, 20(1),192-200,2021
https://doi.org/10.2507/IJSIMM20-1-CO4
Xu, J.; Kim, H.K. (2021). A study on the factors influencing consumers' purchase intention towards Chinese beauty industry: focusing on SNS characteristic elements. Journal of Logistics, Informatics and Service Science, 8(2), 47-64, 2021
Yin, X. C.; Liu, C. P.; Han, Z. (2005). Feature combination using boosting. Pattern Recognition Letters, 26(14), 2195-2205, 2005
https://doi.org/10.1016/j.patrec.2005.03.029
Zhang, H.; Dong, J. (2020). Prediction of repeat customers on E-commerce platform based on blockchain. Wireless Communications and Mobile Computing, 2020(8841437), 2020
https://doi.org/10.1155/2020/8841437
Zhang, Z.; Zeng, D. D.; Abbasi, A.; Peng, J.; Zheng, X. (2013). A random walk model for item recommendation in social tagging systems. ACM Transactions on Management Information Systems 4(2), 1-24, 2013
https://doi.org/10.1145/2490860
[Online]. Available: https://www.census.gov/retail/index.html
[Online]. Available: https://www.cnnic.net.cn/n4/2022/0401/c88-1131.html
[Online]. Available: https://tianchi.aliyun.com/competition/entrance/231576/introduction
[Online]. Available: https://github.com/huiminren/RepeatBuyersPrediction
[Online]. Available: https://github.com/leowang7553/repeatBuyersPrediction
[Online]. Available: https://github.com/Ashitemaru/DM-Tmall-prediction
[Online]. Available: https://github.com/DatAvalon/RepeatBuyersPrediction
Additional Files
Published
Issue
Section
License
Copyright (c) 2022 Mingyang Zhang, Jiayue Lu, Ning Ma, T.C. Edwin Cheng, Guowei Hua
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
ONLINE OPEN ACCES: Acces to full text of each article and each issue are allowed for free in respect of Attribution-NonCommercial 4.0 International (CC BY-NC 4.0.
You are free to:
-Share: copy and redistribute the material in any medium or format;
-Adapt: remix, transform, and build upon the material.
The licensor cannot revoke these freedoms as long as you follow the license terms.
DISCLAIMER: The author(s) of each article appearing in International Journal of Computers Communications & Control is/are solely responsible for the content thereof; the publication of an article shall not constitute or be deemed to constitute any representation by the Editors or Agora University Press that the data presented therein are original, correct or sufficient to support the conclusions reached or that the experiment design or methodology is adequate.