Exploration of Sindhi Corpus Through Statistical Analysis on the Basis of Reality

Irum Naz Sodhar; Suriani Sulaiman; Abdul Hafeez Buller

doi:10.17485/IJST/v16i12.236

Article

Exploration of Sindhi Corpus Through Statistical Analysis on the Basis of Reality

VIEWS 723
PDF 1319

Indian Journal of Science and Technology

DOI: 10.17485/IJST/v16i12.236

Year: 2023, Volume: 16, Issue: 12, Pages: 924-930

Original Article

Exploration of Sindhi Corpus Through Statistical Analysis on the Basis of Reality

Irum Naz Sodhar^1*, Suriani Sulaiman², Abdul Hafeez Buller³

¹Post-Doctoral Fellow, Department of Computer Science, Kulliyyah (Faculty) of Information and Communication Technology, International Islamic University, Malaysia
²Assistant Professor, Department of Computer Science, Kulliyyah (Faculty) of Information
and Communication Technology, International Islamic University, Malaysia
³Post-Doctoral Fellow, Department of Civil Engineering, Kulliyyah (Faculty) of Engineering, International Islamic University, Malaysia

*Corresponding Author
Email: [email protected]

Received Date:01 February 2023, Accepted Date:23 March 2023, Published Date:28 March 2023

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

Objectives: The Sindhi language is given more importance in Sindh’s educational institutions than other regional languages, and the majority of the population uses it in today’s mobile programs, letters, text messages and other text conversations. Research is needed to analyze the Sindhi corpus, as communication over computer systems and mobile phones is growing significantly. This research study focuses on the Sindhi alphabet and performs different tasks on the corpus. Methods: Data collection was conducted from available resources, and a corpus was created in Sindhi and English. Twenty patterns of letters are used, three dot alignments are used in the letters, and six symbols are used for making letters. After the collection, data was explored and analyzed with different tasks. Findings: The corpus of Sindhi text is being built due to its importance for language, linguistics and other developments in NLP. This research study focuses on statically analyzing the Sindhi-English corpus through reality basis, finding that there are two small words ( ۽ and ۾) and three biggest words ( پاڪستان, انگلینڊ and ڳالھیون ). The letter ' آ' is used as a single letter in Sindhi alphabets, with the minimum frequently occurring letter being consonant and the maximum frequently being . ئ vowel. Novelty: Text analysis is an important area in data mining and in other research, and this research study focuses on statically analyzing the Sindhi-English corpus through statically on reality basis. The author explores orthography and Sindhi composition of copra, and recommends that the Romanized languages data be used in Sindhi as well. Preprocessing is not easy due to lack of resources, and the character conversion model has generated two languages.

Keywords: Sindhi; Language exploration; Corpus; Statistical Analysis; pattern of letters; Text conversation

References

Kunchukuttan A, Bhattacharyya P. Utilizing language relatedness to improve machine translation: A case study on languages of the Indian subcontinent. 2020. Available from: https://doi.org/10.48550/arXiv.2003.08925
Ashraf H. The ambivalent role of Urdu and English in multilingual Pakistan: a Bourdieusian study. Language Policy. 2022;22:1–24. Available from: https://doi.org/10.1007/s10993-022-09623-6
Iyengar A, Parchani SL. Like Community, Like Language: Seventy-Five Years of Sindhi in Post-Partition India. Journal of Sindhi Studies. 2021;1(1):1–32. Available from: https://doi.org/10.1163/26670925-bja10002
Nawaz A, Shaikh RA, Arain RH, Rajper S, Baber J, Baidani MM. Text Summarizer for Sindhi Language.
Singh S, Singh S. Systematic review of spell-checkers for highly inflectional languages. Artificial Intelligence Review. 2020;53(6):4051–4092. Available from: https://doi.org/10.1007/s10462-019-09787-4
Sodhar IN, Bhanbhro H, Amur ZH, Jalbani AH, Buller AH. Sindhi Language Processing on. Online SindhiNLP Tool. University of Sindh Journal of Information and Communication Technology. vol. 4:4–7. Available from: https://doi.org/10.13140/RG.2.2.36489.47203
Palh RB, Nawaz H, Shaikh ZA, Wagan AA. Design and Develop CMS for Sindhi E-News Papers. Indian Journal of Science and Technology. 2019;12(46):01–08. Available from: https://doi.org/10.17485/ijst/2019/v12i46/148128
Sodhar IN, Sulaiman S, Buller AH, Sodhar AN. Aspect-Based Sentiment Analysis of Sindhi Newspaper Articles. Available from: https://doi.org/10.22937/IJCSNS.2022.22.5.54
Khoso FH, Memon MA, Nawaz H, Musavi SH. To build corpus of Sindhi language. . 2019. Available from: https://dialnet.unirioja.es/servlet/articulo?codigo=6933916
Sodhar IN, Jalbani AH, Channa MI, Hakro DN. Romanized Sindhi Rules for Text Communication. April 2021. 2021;40(2):298–304. Available from: https://doi.org/10.22581/muet1982.2102.04
Sodhar IN, Jalbani AH, Buller AH, Sodhar AN. Data mining security for big data. CRC Press. 2022. Available from: https://www.taylorfrancis.com/chapters/edit/10.1201/9781003107286-5/data-mining-security-big-data-irum-naz-sodhar-akhtar-hussain-jalbani-abdul-hafeez-buller-anam-naz-sodhar
Khan MA, Zaki S. Corpus Assisted Critical Discourse Analysis of Pakistan’s Language Education Policy Documents: What are the Existing Language Ideologies? SAGE Open. 2022;12(3). Available from: https://doi.org/10.1177/21582440221121805
Sodhar IN, Jalbani AH, Buller AH, Channa MI, Hakro DN. Sentiment analysis of Romanized Sindhi text. Journal of Intelligent & Fuzzy Systems. 2020;38(5):5877–5883. Available from: https://doi.org/10.3233/JIFS-179675
Awamiawaz. 2022. Available from: https://awamiawaz.pk/900983/
, . Govt. of Sindh . 2022. Available from: https://www.sindh.gov.pk/history
Sodhar IN, Buller AH, Sodhar AN. Identification of Online Statistical Translation and Text Issues in Communication Technologies. International Journal of Advanced Trends in Computer Science and Engineering. 2021;10(2):446–453. Available from: https://doi.org/10.30534/ijatcse/2021/021022021
Dootio MA, Wagan AI. Development of Sindhi text corpus. Journal of King Saud University - Computer and Information Sciences. 2021;33(4):468–475. Available from: https://doi.org/10.1016/j.jksuci.2019.02.002

Copyright

© 2023 Sodhar et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Published By Indian Society for Education and Environment (iSee)