Total views : 543

A Hybrid Approach for Extracting Web Information

Affiliations

  • School of Computing, SASTRA University, Thanjavur - 613401, Tamil Nadu, India
  • SASTRA University, Thanjavur - 613401, Tamil Nadu, India

Abstract


Mining the webpage is the predominant technique to grab the data from the internet. It is the extracting job from the web pages in either supervised or unsupervised. Unsupervised extraction extracts more irrelevant data than the relevant and it fails to eliminate the data redundancy. The proposed hybrid approach separating the relevant content from the webpages and filter out the replication. The newly generated hybrid algorithm performs the region separation using tag tree and straining the repeated information. Hence the output contains only reliable data. This approach is the proficient way for extracting the relevant information from the webpages.

Keywords

Hybrid Approach for Extraction, Multi String Alignment, Region Separation, Web Content Mining

Full Text:

 |  (PDF views: 328)

References


  • Sleiman HA, Corchuelo R. TEX an efficient and effective unsupervised web information extractor. Knowledge-Based Systems. 2013.
  • Shah K, Fouzia J, Mashwani SR, Alam I. Linked open data towards the realization of semantic web a review. Indian Journal of Science and Technology. 2014 Jun; 7(6):745–64.
  • Johnson F, Gupta SK. Web content mining techniques a survey. International Journal of Computer Applications. 2012 Jun; 47(11):44–50.
  • Ananthi J. A survey web content mining methods and applications for information extraction from online shopping sites. International Journal of Computer Science and Information Technologies (IJCSIT). 2014; 5(3):4091.
  • Hing JL, Siew EG, Egerton S. Information extraction for search engines using fast heuristic techniques. Data and Knowledge Engineering. 2010; 69(2):169–96.
  • Ferraraa E, De Meob P. Web data extraction applications and techniques: a survey. Knowledge Based Systems. 2014 Jun; 70:301–23.
  • Zheng X, Gu Y, Li Y. Data extraction from web pages based on structural semantic entropy. International World Wide Web Conference Committee (IW3C2); 2012 Apr. p. 93–102.
  • Miao G, Tatemura J, Hsiung WP, Sawires A, Moser LE. Extracting data records from the web using tag path clustering. International World Wide Web Conference Committee (IW3C2); 2009 Apr. p. 981–90.
  • Sleiman HA, Corchuelo R. An unsupervised technique to extract information from semi structured web pages. Web Information Systems Engineering WISE. 2012; 7651:631–7.
  • Crescenzi V, Mecca G. Automatic information extraction from large websites. JACM. 2004; 51(5):731–79.
  • Kushmerick N, Weld DS, Doorenbos RB. Wrapper induction for information extraction. IJCAI. 1997.
  • Mustafa AS, Kumaraswamy YS. Performance evaluation of web services classification. Indian Journal of Science and Technology. 2014 Oct; 7(10):1674–81.
  • Ghobadi A, Rahgozar M. An ontology based semantic extraction approach for B2C eCommerce. The International Arab Journal of Information Technology. 2011 Apr; 8(2):163–70.
  • Upadhyay GM, Dhingra K. Web content mining its techniques and uses. International Journal of Advanced Research in Computer Science and Software Engineering. 2013 Nov.
  • Weninger T. Text Extraction from the Web via Text-to-Tag Ratio; 2009.
  • Zhai Y. Web data extraction based on partial tree alignment. International World Wide Web Conference Committee (IW3C2); 2005 May. p. 76–85.
  • Kayed M, Hui CC. FiVaTech page level web data extraction from template pages. IEEE Transaction on Knowledge and Data Engineering. 2010.
  • Kaur S, Tyagi A. Noise reduction and content extraction from web pages using DOM based page segmentation. JCTA. 2014.
  • Sleiman HA, Corchuelo R. Towards a method for unsupervised web information extraction. ICWE. 2012; 7387:427–30.
  • Gulhane P, Rastogi R, Sengamedu SH, Tengli A. Exploiting content redundancy for web information extraction. IEEE Transaction on Knowledge and Data Engineering. 2010.
  • Sleiman HA, Corchuelo R. A survey on region extractors from web document. IEEE Transaction on Knowledge and Data Engineering. 2012; 25(9):1960–81.
  • Liu W, Meng X, Meng W. ViDE a vision based approach for deep web data extraction. IEEE Transactions on Knowledge and Data Engineering. 2010 Mar; 22(3): 447–60.
  • Hsu CN, Dung MT. Generating finite state transducers for semi structured data extraction from the Web. Information Systems. 1998; 23(8):521–38.
  • Sunil Kumar T, Suvarchala K. A study web data mining challenges and application for information extraction. IOSR Journal of Computer Engineering (IOSRJCE). 2012; 7(3):24–9.
  • Sleiman HA, Corchuelo R. A reference architecture to devise web information extractors CAiSE workshops. Advanced Information Systems Engineering Workshops. 2012; 112:235–48.

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.