• P-ISSN 0974-6846 E-ISSN 0974-5645

Indian Journal of Science and Technology

Article

Indian Journal of Science and Technology

Year: 2021, Volume: 14, Issue: 19, Pages: 1580-1586

Original Article

A new approach to Web Crawling — DHEKTS Crawler in comparison with various Crawlers

Received Date:08 April 2021, Accepted Date:13 April 2021, Published Date:01 June 2021

Abstract

Objectives: To propose a crawler to visit websites for collecting information and create a search engine index for reference; To compare various crawler License, language used for creation, effectiveness with proposed DHEKTS crawler; To compare various characteristics, tasks and functions with proposed DHEKTS crawler; To identify the merits of the DHEKTS Crawler. Methods: A new Crawler called DHEKTS is developed to filter and synchronize documents like Images, Link, and HTML code from a given website. This Crawler is unique in nature since it returns all the details of a particular website having Images, Links, html code and contents. It can crawl through links in a specified website and crawl further to other links on the website. The DHEKTS Crawler is designed for Depth and Relevance crawling. The entire DHEKTS crawler has a few crawling mechanism supporting variety of information. The requirements are Operating System: Win 7 and higher, Front End: PHP, BackEnd: MySQL, RAM: Minimum 4GB and SERVER: High Speed Server with good storage Capacity. Findings: The DHEKTS Crawler has brought web related Links, Images, HTML Code, Information about to fifth level of crawling and Relevance Search giving relevant information. Multiple crawlers fulfill the major functions of crawling but DHEKTS CRAWLER is built to execute all functions in one crawler. Applications: This is applied in Crawling of various Websites and to retrieve valuable data.

Keywords: Crawler; DHEKTS Crawler; License; tasks; functions; effectiveness; Comparison

References

  1. Dhenakaran SS, Sambanthan KT. Web crawler-an overview. International Journal of Computer Science and Communication. 2011;2(1):265–272. Available from: www.csjournals.com/IJCSC/PDF2-1/Article_49.pdf
  2. Amudha S. Web crawler for mining web data. . International Research Journal of Engineering and Technology. 2017;4(2):128–136. Available from: https://www.irjet.net/archives/V4/i2/IRJET-V4I225.pdf
  3. CJ, Hector GM. Synchronizing a Database to Improve Freshness. ACM SIGMOD Record. 2000;29(2):117–128. Available from: https://doi.org/10.1145/335191.335391
  4. Berners-Lee T, Cailliau R. WorldWideWeb: Proposal for a Hypertext Project. 1990. Available from: https://www.w3.org/Proposal.html
  5. AbuKausar M, Dhaka VS, Singh SK. Web Crawler: A Review. International Journal of Computer Applications. 2013;63(2):31–36. Available from: https://dx.doi.org/10.5120/10440-5125
  6. Sambanthan KT, Dhenakaran SS. Web change monitoring and tracking tools. International journal of Computer Science & Communication. 2011;2(2):451–454. Available from: http://www.csjournals.com/IJCSC/PDF2-2/Article_32.pdf
  7. NK, Aggarwal D. LEARNING-based Focused WEB Crawler. IETE Journal of Research. 2021. Available from: https://doi.org/10.1080/03772063.2021.1885312
  8. Devi RS, Manjula D, Siddharth RK, Ackerman MS, Starr B, Pazzani MJ. An efficient approach for web indexing of big data through hyperlinks in web crawling. The Scientific World Journal. 1997;97:17–31. Available from: https://doi.org/10.1155/2015/739286
  9. Lu H, Zhan D, Zhou L, He D. An improved focused crawler: using web page classification and link priority evaluation. Mathematical Problems in Engineering. 2016;6406901 . Available from: https://doi.org/10.1155/2016/6406901
  10. Zheng Q, Wu Z, Cheng X, Jiang L, Liu J. Learning to crawl deep web. Information Systems. 2013;38(6):801–820. Available from: https://doi.org/10.1016/j.is.2013.02.001
  11. Patil Y, Patil S. Review of web crawlers with specification and working. International Journal of Advanced Research in Computer and Communication Engineering. 2016;5(1):220–223. Available from: 10.17148/IJARCCE.2016.5152
  12. . PS. An image crawler for content based image retrieval system. International Journal of Research in Engineering and Technology. 2013;02(11):33–37. Available from: https://dx.doi.org/10.15623/ijret.2013.0211006
  13. Xu S, Yoon HJ, Tourassi G. A user-oriented web crawler for selectively acquiring online content in e-health research. Bioinformatics. 2014;30(1):104–114. Available from: https://dx.doi.org/10.1093/bioinformatics/btt571
  14. Bra PD, Houben GJ, Kornatzky Y, Post R. Information Retrieval in Distributed Hypertexts. InRIAO. 1994;48:1–493.
  15. Sarveshachodankar A, Michael, Walke S, Dr CH, Patil. Literature review on Web Crawling. International Journal of Engineering Research & Technology. 2020;8(5). Available from: https://www.ijert.org/literature-review-on-web-crawling
  16. Yu L, Li Y, Zeng Q, Sun Y, Bian Y, He W. Summary of web crawler technology research. In Journal of Physics: Conference Series 2020. 1449(1):12036. Available from: https://doi.org/10.1088/1742-6596/1449/1/012036
  17. Devi RS, Manjula D, Siddharth RK. An Efficient Approach for Web Indexing of Big Data through Hyperlinks in Web Crawling. The Scientific World Journal. 2015;2015:1–9. Available from: https://dx.doi.org/10.1155/2015/739286
  18. Shaker M, Ibrahim H, Mustapha A, Abdullah LN. A framework for extracting information from semi-structured web data sources. In2008 Third International Conference on Convergence and Hybrid Information Technology. 2008;1:27–31. Available from: https://doi.org/10.1109/ICCIT.2008.60
  19. Steen Mv, Tanenbaum AS. A brief introduction to distributed systems. Computing. 2016;98(10):967–1009. Available from: https://dx.doi.org/10.1007/s00607-016-0508-7
  20. Bar-Yossef Z, Mashiach LT. Local approximation of pagerank and reverse pagerank. InProceedings of the 17th ACM conference on Information and knowledge management. 2008;p. 279–288. Available from: https://doi.org/10.1145/1458082.1458122

Copyright

© 2021 ThirugnanaSambanthan.This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Published By Indian Society for Education and Environment (iSee)

DON'T MISS OUT!

Subscribe now for latest articles and news.