• P-ISSN 0974-6846 E-ISSN 0974-5645

Indian Journal of Science and Technology


Indian Journal of Science and Technology

Year: 2021, Volume: 14, Issue: 20, Pages: 1635-1641

Original Article

Identifying Similar Question Pairs Using Machine Learning Techniques

Received Date:19 February 2021, Accepted Date:17 May 2021, Published Date:04 June 2021


Background/Objectives: Every day millions of people visit search engines like Quora, reedit, stack overflow, etc., the demand for new intelligent techniques is growing, to help individuals find better solutions. Methods: In our proposed system, the Quora datasets were filtered using SQLite which takes one-quarter of the time taken to pre-process the same dataset using existing approaches like python functions. We used machine learning techniques namely the Random Forest, Logistic Regression, Linear SVM (Support Vector Machine) and XGBoost to analyze and identify the most suitable model. Findings: The error log loss functions (0.887, 0.521, 0.654 and 0.357) of the above machine learning techniques were analyzed and compared respectively. The performance of XGBoost is the best among the other models, hence XGBoost is the most efficient model. Conclusion/Future Scope: It is concluded that XGBoost has outperformed other machine learning techniques discussed in the study. It is also found that pre-processing using SQLite has improved the response time. In the future, we would like to apply a similar approach for various other search engines that are available like reedit, stack overflow, etc. and one could ensemble the best models of each type (linear, tree-based, and neural network).


Machine Learning, Question Pair Similarity, XGBoost, Linear SVM, Logistic Regression, Random Forest


  1. Imtiaz Z, Umer M, Ahmad M, Ullah S, Choi GS, Mehmood A. Duplicate Questions Pair Detection Using Siamese MaLSTM. IEEE Access. 2020;8:21932–21942. Available from: https://dx.doi.org/10.1109/access.2020.2969041
  2. Li W, Peng X, Cheng K, Wang H, Xu Q, Wang B. A Short-Term Regional Wind Power Prediction Method Based on XGBoost and Multi-stage Features Selection. 2020 IEEE 3rd Student Conference on Electrical Machines and Systems (SCEMS). 2020;p. 614–618. Available from: 10.1109/SCEMS48876.2020.9352249
  3. Chen M, Liu Q, Chen S, Liu Y, Zhang C, Liu R. XGBoost-Based Algorithm Interpretation and Application on Post-Fault Transient Stability Status Prediction of Power System. IEEE Access. 2019;7:13149–13158. Available from: 10.1109/ACCESS.2019.2893448
  4. Sultana R, Rawat S, Murthy GV, Kumar N. An Investigation on Managing Patient Flow at Hospital Emergency Care Unit Using Tree-Based Data Mining Techniques. In: CR, DS, RS., eds. Advances in Computational Intelligence and Informatics. ICACII 2019. (Vol. 119, pp. 237-243) Springer. 2020.
  5. Dong X, Lei T, Jin S, Hou Z. Short-Term Traffic Flow Prediction Based on XGBoost. 2018 IEEE 7th Data Driven Control and Learning Systems Conference (DDCLS). 2018;p. 854–859. Available from: 10.1109/DDCLS.2018.8516114
  6. Saedi C, Rodrigues J, Silva J, Branco A, Maraev V. Learning Profiles in Duplicate Question Detection. 2017 IEEE International Conference on Information Reuse and Integration (IRI). 2017;p. 544–550. Available from: 10.1109/IRI.2017.39
  7. Xu Z, Yuan H. Forum Duplicate Question Detection by Domain Adaptive Semantic Matching. IEEE Access. 2020;8:56029–56038. Available from: 10.1109/ACCESS.2020.2982268
  8. Wang L, Zhang L, Jiang J. Duplicate Question Detection With Deep Learning in Stack Overflow. IEEE Access. 2020;8:25964–25975. Available from: 10.1109/ACCESS.2020.2968391
  9. Prabowo DA, Budi G, Herwanto. Duplicate Question Detection in Question Answer Website using Convolutional Neural Network. 2019 5th International Conference on Science and Technology. 2019;p. 1–6. Available from: 10.1109/ICST47872.2019.9166343
  10. Mukherjee S, Kumar NS. Duplicate Question Management and Answer Verification System. 2019 IEEE Tenth International Conference on Technology for Education (T4E). 2019;p. 266–267. Available from: 10.1109/T4E.2019.00067
  11. Daoud M. Novel Approach towards Arabic Question Similarity Detection. 2019 2nd International Conference on new Trends in Computing Sciences (ICTCS). 2019;p. 1–6. Available from: 10.1109/ICTCS.2019.8923102
  12. Ye B, Feng G, Cui A, Li M. Learning Question Similarity with Recurrent Neural Networks. 2017 IEEE International Conference on Big Knowledge (ICBK. 2017;p. 111–118. Available from: 10.1109/ICBK.2017.46
  13. Mahmood Q, Qadir MA, Afzal MT. Document similarity detection using semantic social network analysis on RDF citation graph. 2013 IEEE 9th International Conference on Emerging Technologies (ICET). 2013;p. 1–6. Available from: 10.1109/ICET.2013.6743548
  14. Wang J, Li Z, Hu B. A context approach to measuring similarity between questions in the community-based QA services. Seventh International Conference on Fuzzy Systems and Knowledge Discovery. 2010;p. 2408–2411. Available from: 10.1109/FSKD.2010.5569521


© 2021 Anishaa et al.This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Published By Indian Society for Education and Environment (iSee)


Subscribe now for latest articles and news.