• P-ISSN 0974-6846 E-ISSN 0974-5645

Indian Journal of Science and Technology


Indian Journal of Science and Technology

Year: 2015, Volume: 8, Issue: 27, Pages: 1-9

Original Article

Hadoop-based Crawling and Detection of New HTML5 Vulnerabilities on Public Institutions’ Web Sites


HTML5 is a recent version of HTML, a programming language for web documents. It was developed to solve the problems of previous HTML versions. However, the new elements and functions of HTML5 have expanded the range of attacks that third parties can abuse. This is especially the case for public institutions which apply HTML5 in their web sites, and means that their web sites are more vulnerable to these attacks than other private websites. Public institutions’ web sites consist of a larger number of web documents than other general web sites because the web sites provide information regarding policies, voting, and other events, and are connected with subordinate institutions. In this paper, because public institutions web sites consist of a large number of web documents, we usedHadoop, which is an open-source framework for distributed storage and processing. HTML5 vulnerabilities detection was processed for a large number of web documents by using distributed parallel processing. By applying distributed parallel processing for the crawling and detecting processes, we were able to improve the performance of the crawling and detecting processes for a large number of web documents connected to public institutions web sites.
Keywords: Crawling, Distributed Parallel Processing, Hadoop, HTML5 Vulnerability


Subscribe now for latest articles and news.