A Latent Dirichlet Allocation and Fuzzy Clustering Based Machine Learning Model for Text Thesaurus
Keywords:
text, LDA, fuzzy clustering, thesaurus, Word2vec, machine learningAbstract
It is not quite possible to use manual methods to process the huge amount of structured and semi-structured data. This study aims to solve the problem of processing huge data through machine learning algorithms. We collected the text data of the company’s public opinion through crawlers, and use Latent Dirichlet Allocation (LDA) algorithm to extract the keywords of the text, and uses fuzzy clustering to cluster the keywords to form different topics. The topic keywords will be used as a seed dictionary for new word discovery. In order to verify the efficiency of machine learning in new word discovery, algorithms based on association rules, N-Gram, PMI, andWord2vec were used for comparative testing of new word discovery. The experimental results show that the Word2vec algorithm based on machine learning model has the highest accuracy, recall and F-value indicators.References
Adreevskaia, A.; Bergler, S. (2006). Mining wordnet for a fuzzy sentiment: Sentiment tag extraction from wordnet glosses. In 11th conference of the European chapter of the Association for Computational Linguistics, 2006.
Agerri, R.; GarcÃa-Serrano, A. (2010, May). Q-WordNet: Extracting Polarity from WordNet Senses. In LREC, 2010.
Baccianella, S.; Esuli, A.; Sebastiani, F. (2010, May). Sentiwordnet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining. In Lrec (Vol. 10, No. 2010, pp. 2200-2204), 2010.
Blei, D.M.; Ng, A.Y.; Jordan, M.I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan): 993-1022, 2003,.
Chu, X.; Zhong, Q.; Li, X. (2018). Reverse channel selection decisions with a joint third-party recycler. International Journal of Production Research, 56 (18):5969-5981, 2018. https://doi.org/10.1080/00207543.2018.1442944
David, M.; Blei, J.; Lafferty, D. (2005) Correlated Topic Models// Advances in Neural Information Processing Systems 18 [Neural Information Processing Systems, NIPS 2005, December 5-8, 2005, Vancouver, British Columbia, Canada]. MIT Press, 2005.
D'Urso, P.; Leski, J.M. (2019). Fuzzy clustering of fuzzy data based on robust loss functions and ordered weighted averaging. Fuzzy Sets and Systems, 2019. https://doi.org/10.1016/j.fss.2019.03.017
Goldberg, Y.; Levy, O. (2014). Word2vec explained: deriving Mikolov et al.'s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722, 2014.
Gong, D.; Liu, S.; Liu, J.; Ren, L. (2019). Who benefits from online financing? A sharing economy E-tailing platform perspective, International Journal of Production Economics, DOI: 10.1016/j.ijpe.2019.09.011, 2019. https://doi.org/10.1016/j.ijpe.2019.09.011
Griffiths, T.L.; Jordan, M.I.; Tenenbaum, J.B., et al. (2004) Hierarchical topic models and the nested Chinese restaurant process//Advances in neural information processing systems, 17-24, 2004.
Griffiths, T.L.; Steyvers, M.; Blei, D.M., et al. (2005) Integrating topics and syntax//Advances in neural information processing systems, 537-544, 2005.
Hassan, A.; Radev, D. (2010, July). Identifying text polarity using random walks. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (pp. 395-403). Association for Computational Linguistics, 2010.
Hu, M.; Liu, B. (2004, August). Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 168-177). ACM, 2004. https://doi.org/10.1145/1014052.1014073
Li, L.; Li, W. (2019) Naive Bayesian Automatic Classification of Railway Service Complaint Text Based on Eigenvalue Extraction. Tehnicki vjesnik, 26(3): 778-785, 2019. https://doi.org/10.17559/TV-20190420161815
Mcauliffe, J.D; Blei, D.M. Supervised topic models//Advances in neural information processing systems. 121-128, 2008.
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
Mikolov, T.; Le, Q.V.; Sutskever, I. (2013). Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168, 2013.
Snellman, L. (2016). Social Entrepreneurship: Making change in the world. Journal of Logistics, Informatics and Service Science, 3(1), 1-25, 2016.
Wang, X; McCallum, A. (2006) Topics over time: a non-Markov continuous-time model of topical trends//Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 424-433, 2006. https://doi.org/10.1145/1150402.1150450
Wei, K., Gou, J., Chai, R., & Dai, W. (2013, September). Creation of customer evaluation model in the catering industry supply chain ecosystem. In 2013 5th International Conference on Intelligent Networking and Collaborative Systems (pp. 751-756). IEEE, 2013. https://doi.org/10.1109/INCoS.2013.144
Zhang, Q.; Liu, S.; Gong, D.; Tu, Q. (2019). A Latent-Dirichlet-Allocation Based Extension for Domain Ontology of Enterprise's Technological Innovation. International Journal of Computers Communications & Control, Vol. 14, No.1, pp.107-123, 2019. https://doi.org/10.15837/ijccc.2019.1.3366
Zhang, D. (2017). High-speed train control system big data analysis based on the fuzzy rdf model and uncertain reasoning. International Journal of Computers Communications & Control, 12(4), 577-591, 2017. https://doi.org/10.15837/ijccc.2017.4.2914
Zhang, D.; Sui, J.; Gong, Y. (2017). Large scale software test data generation based on collective constraint and weighted combination method. Tehnicki vjesnik, 24(4), 1041-1050, 2017. https://doi.org/10.17559/TV-20170319045945
Published
Issue
Section
License
ONLINE OPEN ACCES: Acces to full text of each article and each issue are allowed for free in respect of Attribution-NonCommercial 4.0 International (CC BY-NC 4.0.
You are free to:
-Share: copy and redistribute the material in any medium or format;
-Adapt: remix, transform, and build upon the material.
The licensor cannot revoke these freedoms as long as you follow the license terms.
DISCLAIMER: The author(s) of each article appearing in International Journal of Computers Communications & Control is/are solely responsible for the content thereof; the publication of an article shall not constitute or be deemed to constitute any representation by the Editors or Agora University Press that the data presented therein are original, correct or sufficient to support the conclusions reached or that the experiment design or methodology is adequate.