• P-ISSN 0974-6846 E-ISSN 0974-5645

Indian Journal of Science and Technology

Article

Indian Journal of Science and Technology

Year: 2023, Volume: 16, Issue: 13, Pages: 1021-1029

Original Article

Building Kashmiri Sense Annotated Corpus and its Usage in Supervised Word Sense Disambiguation

Received Date:15 December 2022, Accepted Date:04 March 2023, Published Date:06 April 2023

Abstract

Objectives: In this research work maiden attempt is made towards developing a sense annotated corpus for Kashmiri Lexical Sample Word Sense Disambiguation (WSD). Sense annotated dataset is required to use Supervised WSD techniques which are the most effective techniques to carry out WSD. As developing a sense-tagged dataset is an arduous task such datasets are not available for all natural languages. Kashmiri being computationally a lowresource language does not have a sense-tagged corpus available for research purposes. Methods: To develop the sense annotated dataset we selected 60 commonly used ambiguous Kashmiri words and annotated the dataset using the manual annotation method. The usefulness of the dataset is also examined by implementing machine learning algorithms (k-NN, Decision Tree (DT) and Support Vector Machine (SVM)) on it. Part of Speech (PoS) and Bag of Words (BoW) features are used to train the classifiers. Findings: The performance of the machine learning algorithms for Kashmiri WSD is evaluated using accuracy metric. Out of the different classifiers used SVM showed the best performance with an average accuracy of 75.74%. Novelty: This research is the first attempt to develop a sense-tagged dataset for Kashmiri language. The developed dataset would be of great importance to the research community and can be used in various Natural Language Processing tasks like WSD, part-of-speech tagging.

Keywords: Sense Annotation; Machine Learning; Word Sense Disambiguation; WordNet; Part-of-Speech Tagging

References

  1. Zhang G, Lu W, Peng X, Wang S, Kan B, Yu R. Word Sense Disambiguation with Knowledge-Enhanced and Local Self-Attention-based Extractive Sense Comprehension. Proceedings of the 29th International Conference on Computational Linguistics. 2022. Available from: https://aclanthology.org/2022.coling-1.357.pdf
  2. Pasini T, Camacho-Collados J. A short survey on sense-annotated corpora. 2018. Available from: https://doi.org/10.48550/arXiv.1802.04744
  3. Akcakaya S, Yildiz OT. An all-words sense annotated Turkish corpus. 2018 2nd International Conference on Natural Language and Speech Processing (ICNLSP). 2018;p. 1–6. Available from: https://dx.doi.org/10.1109/icnlsp.2018.8374368
  4. Saeed A, Nawab RMA, Stevenson M, Rayson P. A Sense Annotated Corpus for All-Words Urdu Word Sense Disambiguation. ACM Transactions on Asian and Low-Resource Language Information Processing. 2019;18(4):1–14. Available from: https://dx.doi.org/10.1145/3314940
  5. Rouhizadeh H, Shamsfard M, Tajalli V, Rouhziadeh M, Persian-Wsd-Corpus. A Sense Annotated Corpus for Persian All-words Word Sense Disambiguation. Available from: https://doi.org/10.48550/arXiv.2107.01540
  6. Ilgen B, Adali E, Tantug AC. Building up Lexical Sample Dataset for Turkish Word Sense Disambiguation. 2012 International Symposium on Innovations in Intelligent Systems and Applications. 2012;p. 1–5. Available from: https://dx.doi.org/10.1109/inista.2012.6247026
  7. Saif A, Omar N, Zainodin UZ, Aziz MJA. Building Sense Tagged Corpus Using Wikipedia for Supervised Word Sense Disambiguation. Procedia Computer Science. 2018;123:403–412. Available from: https://dx.doi.org/10.1016/j.procs.2018.01.062
  8. Saeed A, Nawab R, Stevenson M, Rayson P. A word sense disambiguation corpus for Urdu. 2018. Available from: https://dx.doi.org/10.1007/s10579-018-9438-7
  9. Lone NA, Giri KJ, Bashir R. Natural Language Processing Resources for the Kashmiri Language. Indian Journal Of Science And Technology. 2022;15(43):2275–2281. Available from: https://doi.org/10.17485/ijst/v15i43.1964
  10. Lone NA, Giri KJ, Bashir R. Machine Intelligence for Language Translation from Kashmiri to English. Journal of Information & Knowledge Management. 2022. Available from: https://doi.org/10.1142/S0219649222500745
  11. Kak AA, Mehdi N, Lawaye AA, Lone FA. English-Hindi-Kashmiri E- Dictionary: A Case Study. Linguistic Data Consortium For Indian Languages (LCD-IL). ;p. 21–27. Available from: https://www.academia.edu/35871451/English_Kashmiri_Hindi_e_dictionary_A_Case_Study
  12. Kak AA, Ahmad F, Mehdi N, Farooq M, Hakim M. Challenges, Problems, and Issues Faced in Language-Specific Synset Creation and Linkage in the Kashmiri WordNet. The WordNet in Indian Languages. 2017;209:209–220. Available from: https://dx.doi.org/10.1007/978-981-10-1909-8_12
  13. Lawaye AA, Purkayastha BS. Kashmir Part of Speech Tagger Using CRF. Paripex - Indian Journal Of Research. 2012;3(3):37–38. Available from: https://dx.doi.org/10.15373/22501991/mar2014/11
  14. Liu P. Another View of the Features in Supervised Chinese Word Sense Disambiguation. 2011 Seventh International Conference on Computational Intelligence and Security. 2011;p. 1290–1293. Available from: https://dx.doi.org/10.1109/cis.2011.286
  15. Navigli R. Word sense disambiguation. ACM Computing Surveys. 2009;41(2):1–69. Available from: https://doi.org/10.1145/1459352.1459355
  16. Abid M, Habib A, Ashraf J, Shahid A. Urdu word sense disambiguation using machine learning approach. Cluster Computing. 2018;21(1):515–522. Available from: https://doi.org/10.1007/s10586-017-0918-0
  17. Walia H, Rana A, Kansal VA. A Naïve Bayes Approach for working on Gurmukhi Word Sense Disambiguation. 2017 6th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO). 2017;p. 432–435. Available from: https://doi.org/10.1109/icrito.2017.8342465
  18. Singh VP, Kumar P. Naive bayes classifier for word sense disambiguation of Punjabi language. Malaysian Journal of Computer Science. 2018;31(3):188–199. Available from: https://dx.doi.org/10.22452/mjcs.vol31no3.2

Copyright

© 2023 Mir et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Published By Indian Society for Education and Environment (iSee

DON'T MISS OUT!

Subscribe now for latest articles and news.