• P-ISSN 0974-6846 E-ISSN 0974-5645

Indian Journal of Science and Technology


Indian Journal of Science and Technology

Year: 2023, Volume: 16, Issue: 42, Pages: 3771-3777

Original Article

Google Hacking Database Attributes Enrichment and Conversion to Enable the Application of Machine Learning Techniques

Received Date:18 July 2023, Accepted Date:24 September 2023, Published Date:13 November 2023


Objectives: Apply Natural Language Processing (NLP) to enrich Google Hacking Database (GHDB) with attributes and convert its textual values to ASCII, to enable the application of Machine Learning techniques to group Dorks by similarity and find vulnerabilities. Methods: The computational experiments were conducted in seven steps: Selection of the GHDB, Removal of Hyperlinks and Deletion of Attributes, Removal of the Site Parameter from Dorks, Removal of Outliers and Stopwords, Enrichment with NLP, Base Transformation, and Application of the Self-Organizing Maps (SOM). Findings: The application of NLP allowed segmenting of the Dorks by characters. After that, we converted the characters to their numeric values in ASCII. So, we enrich the GHDB and enable the application of ML techniques, in this case, the SOM. The results obtained with the application of the SOM were considered good. The topographic error (TE) and quantization error (QE) values of the maps generated by SOM were close to 0, which means good accuracy and the maps represent the input data well. Novelty: The formation of clusters of Dorks with SOM after enriching the GHDB with NLP.

Keywords: Google Hacking Database, Dorks, Natural Language Processing, Self­Organizing Maps, Enrichment


  1. Kwak KT, Lee SY, Ham M, Lee SW. The effects of internet proliferation on search engine and over-the-top service markets. Telecommunications Policy. 2021;45(8):102146. Available from: https://doi.org/10.1016/j.telpol.2021.102146
  2. Mazurczyk W, Caviglione L. Cyber reconnaissance techniques. Communications of the ACM. 2021;64(3):86–95. Available from: https://doi.org/10.1145/3418293
  3. Bao W, Lianju N, Yue K. Integration of unsupervised and supervised machine learning algorithms for credit risk assessment. Expert Systems with Applications. 2019;128:301–315. Available from: https://doi.org/10.1016/j.eswa.2019.02.033
  4. Kohonen T. Self-organized formation of topologically correct feature maps. Biological Cybernetics. 1982;43(1):59–69. Available from: https://doi.org/10.1007/BF00337288
  5. Evangelista JRG, Sassi RJ, Romero M, Napolitano D. Systematic Literature Review to Investigate the Application of Open Source Intelligence (OSINT) with Artificial Intelligence. Journal of Applied Security Research. 2021;16(3):345–369. Available from: https://doi.org/10.1080/19361610.2020.1761737
  6. Stahlberg F. Neural Machine Translation: A Review. Journal of Artificial Intelligence Research. 2020;69:343–418. Available from: https://doi.org/10.1613/jair.1.12007
  7. Scheinert D, Casares F, Geldenhuys MK, Styp-Rekowski K, Kao O. Evaluation of Data Enrichment Methods for Distributed Stream Processing Systems. 2023. Available from: https://doi.org/10.48550/arXiv.2307.14287
  8. Platten JV, Sandels C, Jörgensson K, Karlsson V, Mangold M, Mjörnell K. Using Machine Learning to Enrich Building Databases—Methods for Tailored Energy Retrofits. Energies. 2020;13(10):1–22. Available from: https://doi.org/10.3390/en13102574
  9. Muhammad AB, Aminu Y, Sirina FI, Bello AI, Yusif M, Abubakar SMA, et al. Management of Vulnerabilities in Cyber Security. Global Journal of Research in Engineering & Computer Sciences. 2023;3(2):14–18. Available from: https://doi.org/10.5281/zenodo.7779507


© 2023 Evangelista et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Published By Indian Society for Education and Environment (iSee)


Subscribe now for latest articles and news.