• P-ISSN 0974-6846 E-ISSN 0974-5645

Indian Journal of Science and Technology


Indian Journal of Science and Technology

Year: 2022, Volume: 15, Issue: 41, Pages: 2188-2193

Original Article

Structuring of Unstructured Data from Heterogeneous Sources

Received Date:30 July 2022, Accepted Date:28 September 2022, Published Date:10 November 2022


Objectives: To develop a new data gathering processing under Big Data Perspectives. To convert unstructured text data into structured format by not missing out any text data available. Methods: The unstructured data is preprocessed using modified stemming and tokenization. From the stemming output, the proposed Term Frequency-Inverse Document Frequency (TF-IDF) and N-gram features are derived. Unstructured data is considered from multiple sources like twitter, consumer complaints and news blog. Findings: The proposed model with extant TF-IDF features has exposed relatively high Mean Average Error (MAE) value which is 1.4325 when compared to the proposed model without optimization to be 0.5197. Novelty: The novelty of the research work is of the stemming process where dictionary checking process is added and the improved feature extraction, interclass dispersion coefficient is computed in TF-IDF features.

Keywords: Natural language processing; Structured data; Unstructured data; Big data; Feature extraction


  1. Kumar A, Dabas V, Hooda P. Text classification algorithms for mining unstructured data: a SWOT analysis. International Journal of Information Technology. 2020;12(4):1159–1169. Available from: https://doi.org/10.1007/s41870-017-0072-1
  2. Kumari S, Agarwal B, Mittal M. A Deep Neural Network Model for Cross-Domain Sentiment Analysis. International Journal of Information System Modeling and Design. 2021;12(2):1–16. Available from: https://doi.org/10.4018/IJISMD.2021040101
  3. Mowafy M, Rezk A, El-Bakry H. An efficient classification model for unstructured text document. American Journal of Computer Science and Information Technology. 2018;6(1):16. Available from: https://doi.org/10.21767/2349-3917.100016
  4. Chen L, Kong Y, Lin J. Trend Prediction Of Stock Industry Index Based On Financial Text. Procedia Computer Science. 2022;202:105–110. Available from: https://doi.org/10.1016/j.procs.2022.04.014
  5. Kocaman V, Talby D. Spark NLP: Natural Language Understanding at Scale. Software Impacts. 2021;8:100058. Available from: https://doi.org/10.1016/j.simpa.2021.100058
  6. Khyani D, Siddhartha BS, Niveditha NM, Divya BM. An Interpretation of Lemmatization and Stemming in Natural Language Processing. Journal of University of Shanghai for Science and Technology. 2021. Available from: https://jusst.org/an-interpretation-of-lemmatization-and-stemming-in-natural-language-processing/
  7. Solangi YA, Solangi ZA, Aarain S, Abro A, Mallah GA, Shah A. Review on Natural Language Processing (NLP) and Its Toolkits for Opinion Mining and Sentiment Analysis. 2018 IEEE 5th International Conference on Engineering Technologies and Applied Sciences (ICETAS). 2018;p. 1–4. Available from: https://doi.org/10.1109/ICETAS.2018.8629198
  8. Pimpalkar AP, Raj RJR. Influence of Pre-Processing Strategies on the Performance of ML Classifiers Exploiting TF-IDF and BOW Features. ADCAIJ: Advances in Distributed Computing and Artificial Intelligence Journal. 2020;9(2):49–68. Available from: http://digital.casalini.it/5010980
  9. Awwalu J, Bakar AA, Yaakub MR. Hybrid N-gram model using Naïve Bayes for classification of political sentiments on Twitter. Neural Computing and Applications. 2019;31(12):9207–9220. Available from: https://doi.org/10.1007/s00521-019-04248-z
  10. Atay M, Kalayci M, Apik H, Aybar V, Serin F, Akyuz AO. An Approach to Analyzing the Layout of Unstructured Digital Documents. 2022 30th Signal Processing and Communications Applications Conference (SIU). 2022;p. 1–4. Available from: https://doi.org/10.1109/SIU55565.2022.9864787
  11. Lang HX, Li YY, Wang Y, Wang H, Dong J. An Automatic Topic-oriented Structured Text Extraction Method based on CRF and Deep Learning. 2022 IEEE 25th International Conference on Computer Supported Cooperative Work in Design (CSCWD). 2022;p. 1408–1413. Available from: https://doi.org/10.1109/CSCWD54268.2022.9776155


© 2022 Shilpa & Shambhavi. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Published By Indian Society for Education and Environment (iSee)


Subscribe now for latest articles and news.