Adlet: Active Document for Adaptive Information Integration (part1)

ABSTRACT: We describe a new approach to resource discovery based on the concept of active documents advertising on the Internet, whereby a document dynamically builds metadata, joins a group of documents of the same interest and advertises the collected metadata to the members of the group. This abstraction of metadata is called an adlet, which is the core of our approach. This approach supports the visual specification and visualization of adlets, the creation and composition of adlets, the negotiation protocols and communications with the active network. Two important features make this approach applicable to applications in information fusion, information retrieval, data mining, geographic information systems, medical information systems and plant genome information gathering: a) any document, including web page, database record, video file, audio file, image and even paper documents, can be enhanced by an adlet and become an active document; b) any node in a nonactive network can be enhanced by adlet-savvy software and the adlet-enhanced node can co-exist with other non-enhanced nodes. An experimental prototype provides a testbed for feasibility studies in a hybrid active network environment.

Introduction and Motivation
The Internet has become an indispensable means of communication and collaboration among people at different levels and in various capacities. It provides access to a bewilderingly large number and variety of resources, including text, audio and video files, scientific data, retail products, network services, and transcripts of conversations. Because of the scale and decentralized nature of this environment, the Internet has evolved into a chaotic repository of all types of information, making it difficult to locate resources of interest. If it is to continue to grow and thrive as new means of communication, new tools are needed to organize and support the Internet's resource discovery in a fashion that keeps pace with its exponential growth in size and diversity.
Over the past several years, a number of Internet information discovery tools and search engines have been introduced, including Lycos, Alta Vista, Infoseek, Yahoo and Harvest [AltaVista, Bowman94, Hardy95]. These automated tools bear most of the responsibility for organizing information on the Internet by automatically classifying and indexing collections of digital data. They periodically dispatch indexing engines, frequently referred to as "Web Crawlers", to download and examine documents in order to extract information that describes these documents [Cheong96]. The extracted information is stored in the search engine's database, along with the uniform resource locator of the site where the document resides. The stored information is later used to produce a list of resources in response to requests submitted by Internet users to query the search engine's database [Demsey96, Hardy95, Lagoze96].
These tools have become quite popular and are helping to redefine how people think about wide area network applications. In theory they can address the fundamental problem of resource indexing and discovery on the Internet. Consequently they have the potential to address the inability of human users to cope with the Internet's rapid growth and enormous data volume. Yet it has become clear that they are not well suited to supporting the ever evolving information infrastructure, characterized by enormous data volume, rapid growth in the user base, and burgeoning data diversity. This is mostly due to the fact that these tools are not capable of identifying basic characteristics of a document, such as its theme or genre, let alone its underlying meaning or context. They can only provide uniform and equal access to all resources in the Internet, for they cannot reliably extract routine information that a human indexer can find through a simple cursory inspection. As a result, queries issued by Internet users often produce an overwhelming number of responses, which frequently contain references to irrelevant material while leaving out more relevant ones.
This paper addresses some of the problems involved with resource discovery in the Internet. Our approach is based on the concept of active documents advertising, whereby a document dynamically builds metadata, joins a group of documents of the same interest and advertises the collected metadata to the members of the group. This abstraction of metadata is called an adlet, which is the core of our approach.
The characterization of metadata may range from a title or type, to key features of the document or of the author. The metadata is derived from a) the user specified detailed annotations of the document, b) the advertisement strategy and c) the recruiting strategy. The annotations provide in-depth characterization of a document and is used as basis for indexing, document retrieval and information fusion. The advertisement strategy defines the policies used by the active document in advertising itself to the rest of the Internet. The recruiting strategy determines the type of documents allowed to become associated and/or fused with this active document in its evolution toward an enhanced and richer document.
The paper is organized as follows. Section 2 presents the system architecture. The basic definitions for adlet are introduced in Section 3. The visual specification of adlets by annotation, the visualization of adlets and real/virtual sites, and the merging of adlets are discussed in Section 4. Section 5 describes our approach for adlets group management and negotiations. The construction and update of virtual graph, which together with the adlets form the knowledge base, is described in Section 6. An experimental prototype based upon the active index system is being built, as described in Section 7. In Section 8 we compare our approach with ongoing research. Concluding remarks about further research are given in Section 9.


System Architecture
The system architecture supporting active document advertising should address the issue of how to organize the resource space flexibly and dynamically. Rather than imposing a hierarchical structure, our approach allows the active document structure to evolve in accordance with usage patterns, based on the advertising and recruiting strategies specified by the user. More specifically the active documents dynamically organize and search the resource space by constructing links among themselves based on the metadata that describe their contents, types, context, the advertising and recruiting strategies specified by the user, and the semantics of the type of active document being sought. The links form a virtual graph, with a flexible set of hierarchies embedded within the graph to provide efficient searching. The virtual graph's edges are associated with weights that reflect the likelihood that the adlet attached at the other end is interested in recruiting this document or is itself representing a document that can be of interest to the current active document. The virtual graph's structure evolves over time through the use of virtual graph updating algorithms in accordance with user specified metadata. This approach provides the basis for the development of a scalable, customizable architecture for gathering, indexing, caching, replicating, accessing and fusing Internet information.

Adlets use the virtual graph to locate an adequate number of other adlets and their corresponding documents situated in their sites. Active index cells [Chang95] are associated with adlets representing abstracted active documents, and through these index cells advertisements can be sent and received in accordance with negotiation protocols.
The active index system [Chang96b] collects indexing information received from other adlets, suppresses duplicate information and filters out undesired information using the user-specific profile. The active index system then summarizes adlet information in various type-specific ways to generate structured indexing information that can be used by the Adlet Visualizer and search entities [Abit97]. The active index is designed to support multiple "views" in a way such that the gathered information about adlets can be easily extracted and correlated and integrated with local adlet information and information received from remote sites [Kim95, Sheth90]. The fusion of information will reflect the profile of the associated user and will facilitate the visualization of different views of the related active documents.


Our approach can be summarized in plain language as follows: Adlet tells you what to look for and what to avoid, virtual graph tells you where to go, and negotiation protocol tells you when to go and how to go. Figure 1 shows the system architecture. The Adlet Visualizer supports visual specification and visualization of adlets. The Adlet Manager supports creation and composition of adlets. The Adlet Negotiator handles negotiation protocols and communications with the network. The Active Index System maintains the virtual graph and handles all messages to/from adlets and active documents.


Adlets
In this paper we concern ourselves mainly with the advertising, negotiation and exchange of active documents. The documents are active in the sense that each document can perform actions with autonomy. An adlet is the means by which an active document makes itself known to other active documents. An active document itself may post an adlet, or a user may post an adlet for an active document. An adlet has a target, i.e., it is meant only for active documents capable of negotiating and exchanging certain classes of documents. Usually the target can be defined based upon the notion of document hierarchy, i.e., documents satisfying certain conceptual relations are regarded as the target. Finally documents to be avoided constitute the non-target for this adlet.
The semantic aspects of the advertisement associated with an adlet must be defined in a controlled and uniform manner. The user/application must be able to request a style of advertisement, i.e., a certain advertising strategy.
Upon its creation, an adlet attempts to join a group of adlets, possibly creating a new one if such a group does not yet exist. Other adlets may join the group, leave the group, advertise to the group and receive advertisement from other members of the group.

Definition 1: An adlet is defined as adlet = (doc, profile, target, non-target, ad-strategy, prop) where doc uniquely identifies the document to be advertised, profile is a set of conceptual relations from a concept space characterizing this document, target is a set of conceptual relations characterizing documents to be recruited, non-target is a set of conceptual relations characterizing documents to be avoided, ad-strategy is the advertising strategy, and prop are other properties (see Section 4.1).
The profile is specified, in the simplest case, by keywords associated with the document. In general the profile is a set of conceptual relations, or a conceptual graph, derived from user's annotations. (We regard a concept as a special case of a conceptual relation with a null second part.) This information is used by the adlet to develop the virtual graph and maintain it as it evolves dynamically.
The target specifies the scope and the focus of the advertisement. The scope of an advertisement may be global, restricted or local. Within a target the focus of the advertisement may be all adlets or a selected subset of adlets. The non-target specifies what is out of the scope.
The ad-strategy determines the advertising strategy used for document advertisement. The strategy could be aggressive or reactive. An aggressive strategy initiates advertisement immediately after its creation and continues advertising in a periodic fashion. A reactive strategy on the other hand engages in advertisement only in response to advertisement received from other active documents.

Example 1: A document can be a WWW page and therefore doc = URL. The document is characterized by a set of keywords and therefore profile = {keyword1, ...., keywordn}. This WWW page is in the class of browsable documents and the intended target is also a selected collection of browsable documents, i.e., target = {'Browser'}, where 'Browser' characterizes a document class in the class hierarchy. The non-target is unspecified and therefore the default is the empty set. The advertising strategy is reactive.

Example 2: A document can be a database record. The record_id is its doc. The document is characterized by a conceptual schema describing the conceptual structure in which this record instance is situated, and therefore its profile = { schema_name }. The target is { 'financial record' }, denoting the collection of financial records of an enterprise. The non-target is { 'scientific' }. The advertising strategy is aggressive.

Example 3: An active document may post an adlet by itself. In this case, ad = (doc, profile, {'all'}, {'none'}, ad-strategy, prop). In other words, this adlet is posted by one active document for 'all', with 'none' to be avoided. Such an adlet needs to be refined so that it can aim at more specific targets.

Definition 2: Two adlets ad1 and ad2 can be partially ordered as follows, ad1 < ad2 if and only if the following holds: a) doc1 = doc2, b) profile1 = profile2, c) target1 is contained in target2, d) non-target1 contains non-target2, e) ad-strategy1 = ad-strategy2, and f) prop1 = prop2.
In other words ad1 is intended for a smaller target group, so ad1 is more refined than ad2.
Definition 3: The negotiation protocol is a sequence e-proc= ((d11,d12), (d21,d22), .., (dn1, dn2)), where each (di1,di2) is an exchanged pair of documents and the final pair (dn1,dn2) ia s pair of goal documents.

Example 4: Let the first entry be the consumer's and the second the producer's, then an information exchange negotiation protocol typically may look like: (('Mars',-), (-,'Heaven'), ('Sojourner', -), (-, doc-13), ('Thanks', -)).
Therefore, the adlets should induce a sequence of information exchanges e-seq between the consumer and the producer, such that e-seq contains e-proc as a sub-sequence. If this is feasible, then we say the negotiation protocol is supportable.

Definition 4: A negotiation protocol is supportable if the adlets can induce a sequence of information exchanges e-seq such that e-seq contains e-proc as a subsequence and the final pair (dn1,dn2) of e-seq is a pair of goal documents.
Additional constraints may be imposed on the supportable negotiation protocol, such as the progressive refinement of the adlets, leading to a more precise search criterion and consequently a smaller set of matching documents.


A Scenario
Adlets can be generated by active documents to interact with other adlets or active documents to accomplish the objectives of an advertisement plan. In this section we describe a scenario of active documents advertising using adlets.


Visual Specification of Adlets by Annotation
When a document type is defined, the user can visually specify the types and properties of its adlets. The design of this visual specification language is based upon the theory of visual languages. The mode of interaction is direct manipulation. The user annotates the objects on the screen by pointing, clicking and entering keywords.
The user first defines a concept space where the concepts are color-coded. Each adlet type has a dominant color which is the concept that characterizes it most closely, and other less dominant colors corresponding to the concepts that characterize it with varying degree of closeness. The darkness or lightness of a color indicates the closeness of this characterization. Each adlet type also has a dominant color which is the concept that attracts it most, and other less dominant colors for less attractive concepts. Finally, concepts that repulse the adlet are also defined. These color codes are shown in different areas on the adlet or dynamically shown in an alternating fashion.
Figure 2 illustrates the visual specification of an adlet which carries the following information: "I represent doc, I am profile, I love target, I hate non-target, I stay put (reactive) or travel (aggressive) and I behave like prop". 

The attraction/repulsion property of an adlet is important in determining the motion behavior of an adlet. An attraction/repulsion force of an adlet to a color-coded concept can be one of the following, as specified by the user (perhaps by clicking on a sliding scale):

A1. Extreme attraction: the adlet must merge with an adlet/site of said color.
A2. Strong attraction: the adlet attempts to merge with an adlet/site of said color until precondition becomes 'false'.
A3. Weak attraction: the adlet makes a best effort attempt to merge with an adlet/site of said color.
A4. Neutral: no attraction/repulsion.
A5. Weak repulsion: the adlet makes a best effort attempt to avoid an adlet/site of said color.
A6. Strong repulsion: the adlet attempts to avoid an adlet/site of said color until precondition becomes 'false'.
A7. Extreme repulsion: the adlet avoids the adlet/site with said color, even if this means its own destruction.

The ad-strategy of an adlet type can be specified by clicking on either "reactive" or "aggressive". Other properties of an adlet type to be specified by the user include the following:

B. Destructability: Adlet can be destructive or nondestructive. Destructive adlet can be destroyed when a condition becomes 'true'. Nondestructive adlet cannot be destroyed under ordinary circumstances.

C. Regenerability: Adlet can be regenerating or nonregenerating. Regenerating adlet is able to regenerate instances of its identical type. Nonregenerating adlet cannot regenerate.

D. Migration: Adlet can be migrating or nonmigrating. A migrating adlet will seek out site that attracts it and migrate to that site. Nonmigrating adlet will not migrate by itself. (An aggressive/reactive ad-strategy leads to migrating/nonmigrating adlets.)

E. Temporal Sensitivity: Adlet can be temporally sensitive or nontemporally sensitive, depending on whether its predicates and actions involve time as parameters.

F. Location Sensitivity: Adlet can be location sensitive or nonlocation sensitive, depending on whether its predicates and actions involve location as parameters.

G. Context Sensitivity: Adlet can be context sensitive or noncontext sensitive, depending on whether its predicates and actions involve contextual variables as parameters.

After an initial session to perform detailed annotation, the user creates an adlet type which can be reused in the future when similar document type is encountered.

Share on Google Plus

About Unknown

This is a short description in the author block about the author. You edit it by entering text in the "Biographical Info" field in the user admin panel.

0 comments:

Post a Comment

Thanks for your Valuable comment