Structuring of Unstructured Data from Heterogeneous Sources

B L Shilpa; B R Shambhavi

doi:10.17485/IJST/v15i41.1566

Article

Structuring of Unstructured Data from Heterogeneous Sources

VIEWS 780
PDF 192

Indian Journal of Science and Technology

DOI: 10.17485/IJST/v15i41.1566

Year: 2022, Volume: 15, Issue: 41, Pages: 2188-2193

Original Article

Structuring of Unstructured Data from Heterogeneous Sources

B L Shilpa^1*, B R Shambhavi²

¹Assistant Professor, Department of ISE, GSSSIETW, Mysuru, Karnataka, India
²Associate Professor, Department of ISE, BMSCE, Bengaluru, Karnataka, India

*Corresponding Author
Email: [email protected]

Received Date:30 July 2022, Accepted Date:28 September 2022, Published Date:10 November 2022

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

Objectives: To develop a new data gathering processing under Big Data Perspectives. To convert unstructured text data into structured format by not missing out any text data available. Methods: The unstructured data is preprocessed using modified stemming and tokenization. From the stemming output, the proposed Term Frequency-Inverse Document Frequency (TF-IDF) and N-gram features are derived. Unstructured data is considered from multiple sources like twitter, consumer complaints and news blog. Findings: The proposed model with extant TF-IDF features has exposed relatively high Mean Average Error (MAE) value which is 1.4325 when compared to the proposed model without optimization to be 0.5197. Novelty: The novelty of the research work is of the stemming process where dictionary checking process is added and the improved feature extraction, interclass dispersion coefficient is computed in TF-IDF features.

Keywords: Natural language processing; Structured data; Unstructured data; Big data; Feature extraction

References

Kumar A, Dabas V, Hooda P. Text classification algorithms for mining unstructured data: a SWOT analysis. International Journal of Information Technology. 2020;12(4):1159–1169. Available from: https://doi.org/10.1007/s41870-017-0072-1
Giudice PL, Musarella L, Sofo G, Ursino D. An approach to extracting complex knowledge patterns among concepts belonging to structured, semi-structured and unstructured sources in a data lake. Information Sciences. 2019;478:606–626. Available from: https://doi.org/10.1016/j.ins.2018.11.052
Zaman G. Information extraction from semi and unstructured data sources: A systematic literature review. ICIC Exp. Lett. 2020;14:593–603. Available from: http://www.icicel.org/ell/contents/2020/6/el-14-06-09.pdf
Kumari S, Agarwal B, Mittal M. A Deep Neural Network Model for Cross-Domain Sentiment Analysis. International Journal of Information System Modeling and Design. 2021;12(2):1–16. Available from: https://doi.org/10.4018/IJISMD.2021040101
Mowafy M, Rezk A, El-Bakry H. An efficient classification model for unstructured text document. American Journal of Computer Science and Information Technology. 2018;6(1):16. Available from: https://doi.org/10.21767/2349-3917.100016
Chen L, Kong Y, Lin J. Trend Prediction Of Stock Industry Index Based On Financial Text. Procedia Computer Science. 2022;202:105–110. Available from: https://doi.org/10.1016/j.procs.2022.04.014
Kocaman V, Talby D. Spark NLP: Natural Language Understanding at Scale. Software Impacts. 2021;8:100058. Available from: https://doi.org/10.1016/j.simpa.2021.100058
Kim D, Seo D, Cho S, Kang P. Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec. Information Sciences. 2019;477:15–29. Available from: https://doi.org/10.1016/j.ins.2018.10.006
Khyani D, Siddhartha BS, Niveditha NM, Divya BM. An Interpretation of Lemmatization and Stemming in Natural Language Processing. Journal of University of Shanghai for Science and Technology. 2021. Available from: https://jusst.org/an-interpretation-of-lemmatization-and-stemming-in-natural-language-processing/
Solangi YA, Solangi ZA, Aarain S, Abro A, Mallah GA, Shah A. Review on Natural Language Processing (NLP) and Its Toolkits for Opinion Mining and Sentiment Analysis. 2018 IEEE 5th International Conference on Engineering Technologies and Applied Sciences (ICETAS). 2018;p. 1–4. Available from: https://doi.org/10.1109/ICETAS.2018.8629198
Pimpalkar AP, Raj RJR. Influence of Pre-Processing Strategies on the Performance of ML Classifiers Exploiting TF-IDF and BOW Features. ADCAIJ: Advances in Distributed Computing and Artificial Intelligence Journal. 2020;9(2):49–68. Available from: http://digital.casalini.it/5010980
Awwalu J, Bakar AA, Yaakub MR. Hybrid N-gram model using Naïve Bayes for classification of political sentiments on Twitter. Neural Computing and Applications. 2019;31(12):9207–9220. Available from: https://doi.org/10.1007/s00521-019-04248-z
Atay M, Kalayci M, Apik H, Aybar V, Serin F, Akyuz AO. An Approach to Analyzing the Layout of Unstructured Digital Documents. 2022 30th Signal Processing and Communications Applications Conference (SIU). 2022;p. 1–4. Available from: https://doi.org/10.1109/SIU55565.2022.9864787
Lang HX, Li YY, Wang Y, Wang H, Dong J. An Automatic Topic-oriented Structured Text Extraction Method based on CRF and Deep Learning. 2022 IEEE 25th International Conference on Computer Supported Cooperative Work in Design (CSCWD). 2022;p. 1408–1413. Available from: https://doi.org/10.1109/CSCWD54268.2022.9776155

Copyright

© 2022 Shilpa & Shambhavi. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Published By Indian Society for Education and Environment (iSee)