Information Technologies

Web Wrapper Agent

A Web wrapper is a program that wraps one or more Web sites so that other applications can process the data containing in those Web sites for information integration. Web information integration is different from database information integration due to the nature of the Web, where data are contained in interlinked Web pages rather than tables or objects with clearly defined schema as in database systems. Building wrappers of relational databases is relatively easy because they are ready for access by another program. Web wrappers, however, must automate Web browsing sessions to extract data from the target Web pages. But each Web site has its particular page linkages, layout templates, and syntax. A brute-force solution is to program a wrapper for each particular browsing session.That solution, however, may lead to wrappers that are sensitive to Web site changes and thus may become difficult to scale up and maintain. We provides a solution for rapidly generating intelligent agents that serve as Web wrappers for Web information integration. Our solution emphasizes reconfigurability of the Web wrappers so that they can be rapidly developed and easily maintained without skillful programmers.First of all, we define an XML-based script language, called WNDL (Web Navigation Description Language). Scripts written in WNDL are interpreted and executed by a WNDL executor, which offers the following features:

An early prototype of our system is equipped with a wrapper induction system called Softmealy to generate data extractors. Recently, we have developed another algorithm called IEPAD (an acronym for information extraction based on pattern discovery). Unlike the work in wrapper induction, IEPAD applies sequential pattern mining techniques to discover data extraction patterns from a document. This removes the need of manually labeling training examples and thus minimizes human intervention. There are some heuristic-based work on the market that claim to be able to extract data from the Web automatically. However, those work are limited to a very narrow class of Web pages that matches their heuristics. In contrast, IEPAD does not depend on heuristics.A complete Web wrapper agent includes a WNDL script as well as IEPAD data extractors. We also developed a programming-by-example authoring tool which allows users to generate a Web wrapper agents by browsing the target Web sites for their particular information gathering task. The generated Web wrapper agent can be reconfigured through the same authoring tool to maximize the maintainability and scalability for a Web information integration system.