INTRODUCTION TO HETEROGENEOUS DATABASES
Contents
- Introduction
- Basic Definitions
- Classification Of Heterogeneous Databases
- Problems In Integration Of Heterogeneous Databases
- Federated Multidatabase Systems
5.1 Requirements for federated database management system
5.2 Characteristics of a federated database environment
5.3 Architecture of a federated system
5.4 Some Interesting Topics
- Data Sharing
- Object-Oriented Multidatabase Systems
- Existing Heterogeneous Systems
8.1 Non-object oriented federated databases
8.2 Object oriented databases
- Current Issues
1. Introduction
The integration of distributed database systems poses many challenges due to
differences in the data management systems (i.e. different vendors),in the data models (i.e.,
relational, text indexing),in the query and the transaction processing algorithms, in the data
type (i.e. text, graphics, multimedia, hypermedia, sensor, knowledge bases), in the format
(i.e., structured, unstructured), and in the semantics.
Within the intelligence community, many volumes of heterogeneous legacy
intelligence databases exist. For example, an intelligence analyst may have to look for
daily incoming messages in Verity TOPIC, specific domain data in DB2, archival messages
in an older proprietary database, and a group discussion of a related topic in Lotus Notes.
Ultimately, the goal is for the analyst to utilize each of these data sources to obtain useful
information in a timely and simplistic manner.
2. Basic Definitions
A distinguishing property of a distributed database is that it can be homogeneous
or heterogeneous. A homogeneous distributed database is one where all the local
databases are managed by the same DBMS. This approach is the simplest one and
provides incremental growth, which makes the addition of a new site in the network easy,
and increased performance, by exploiting the parallel processing capability of multiple
sites.
A heterogeneous distributed database is one where the local databases need not be
managed by the same DBMS. For example, one DBMS can be a relational system while
another can be a hierarchical system. This approach is far more complex than the
homogeneous one but enables the integration of existing independent databases without
requiring the creation of a completely new distributed database. In addition to the main
functions, the distributed DBMS must provide interfaces between the different DBMS.
A federated database is a combination of autonomous, heterogeneous databases
which are operating together.
The software which co-ordinate all these different databases is called federated
database management system (FDBMS) and its form can be viewed in
figure 1. The basic
characteristic of a federated system is the co-ordination among independent database
systems.
Every database which takes part in the federation is called a component database.
One component database can take part in more than one federations, and its database
manager can be centralized or distributed. One important characteristic of a federated
management system is that every component database can continue executing its local
tasks, while it is participating in the federation.
The federated database management systems are subsets of a more general
category of database systems, the multidatabase systems.
Finally a systems that allows periodical data transfer among the different databases
is called data exchange system.
3. Classification of heterogeneous databases
Approaches to managing heterogeneous databases include linking heterogeneous
databases via the World Wide Web(WWW), organizing them into database federations or
multidatabase systems, and constructing data warehouses. Common to these approaches
is allowing component databases to preserve their autonomy, that is, their local definitions,
applications, and policy of exchanging data with other databases.[Bright et al. 1992].
Heterogeneous database systems have been traditionally classified by the type of
schemas, by their extent of data sharing, and by the data access facilities they support.
Schemas supported by a heterogeneous database system include :
- local views expressed representing the schemas of component databases
expressed in the Data Definition Language (DDL) of local databases; and
- a global schema expressed in a common DDL, providing a unified view of the
schemas of all component databases.
Data sharing in a heterogeneous databases system can be at the level of :
- linking specific data items in the component databases; or
- generic (schema driven) correlations across component databases.
Individual data item links (e.g. hypertext links) between databases do not require
or comply with schema correlations across databases. For schema correlations, data links
need to be consistent with the constraints entailed by these correlations, such as inter-
database referential integrity constraints.
Data access facilities in a heterogeneous database system can range from :
- browsing across component databases; to
- querying a centralized data warehouse; to
- querying multiple databases.
Browsing across component databases is usually based on traversing WWW
hyperlinks between data items in a database to data items in another database, and does
not require schema correlations. Querying a data warehouse amounts to querying a single
database, where the data of all component databases are represented according to the
global schema of the warehouse. Querying multiple databases is carried out by expressing
queries over the global schema of the heterogeneous database system or over the
component database local views to queries for component databases. Alternatively, a
heterogeneous database system can be provided with a multidatabase query language
that allows expressing queries that refer directly to elements of component
databases.
In addition to all the above the multidatabase systems can be classified according
to the following criteria : distribution, heterogeneity, autonomy[Sheth and Carson, 1990].
Distribution
The data can be distributed in many databases. These databases can be in the same
or in different computer systems. The advantage of the distribution of data is that we
can now access them more easily. In the case of federated databases, the reason for this
distribution is that these databases existed before their federation was formed.
Heterogeneity
Many types of heterogeneity are due to technological differences such that
differences exist in the hardware, in the software, or in the operating system. Many
software engineers have dealt with this problem, and as a result, there are a lot of
commercial database management systems, nowadays, that they work perfectly under such
heterogeneous conditions. Generally we can categorize the heterogeneities into 2 sub-
categories. The heterogeneities that are due to differences of the various database
management systems that are invoked in a heterogeneous database, and the
heterogeneities that are due to differences in the semantics of the data.
In the first sub-category we have all the heterogeneous databases that include
databases which have differences in the structures used, in the restrictions in the domains,
or in the query languages that each one uses.
In the second sub-category we have the heterogeneous databases that include
databases which have differences in the representing meaning of the data or in the
translation or the use of the same or relative data. Even the finding of these kinds of
heterogeneities is indeed a difficult problem[Seltor et al., 1993].
Finally, we should point out that one important problem of co-operation of
heterogeneous databases : the finding and solving of the heterogeneities that they
have.
Autonomy
One component database, which is part of a federated database system, can have
many kinds of autonomies.
1. Design autonomy
This is the ability that a certain database can choose its own design.
(In other words to choose its own data, its own way of representing these
data, and its own domain of its attributes.)
2. Communication autonomy
This is the ability of a component database to decide on its own when it
“wants” to communicate with other component databases, or with the
federated database system.
3. Execution autonomy
This is the ability of a component database to execute locally some operations,
without being influenced by the external operations that are executed in other
component databases or in the federated database system. This implies that a
certain component database can reject any operation that is against its local
constraints.
4. Association autonomy
This is the ability of a component database to decide by itself how much and
for how long it would share its resources with other component databases or
with the federated database system. In this autonomy it is also included the
ability of this certain component database to correlate or not itself with the
federation and the ability to participate into one or more federations.
The multidatabase systems can be classified into 2 categories according to the
association autonomy of their component databases : non-federated database systems and
federated database systems (See Sections
1 and 5 also).
A non-federated database system is a collection of database systems that are not
autonomous. This system has just one level of management and all the operations are
applied to all the components databases. In such a system there is no distinction between
local and non-local users or even operations. If all the database components are fully
associated, the non-federated systems is called unified multidatabase system.
A federated database system consists of component databases, that although they
are autonomous, they participate in the federation and they allow partial and controlled
sharing of their data. The federated database systems are in the middle between no co-
ordination and fully co-ordination among heterogeneous databases. The federated
database systems can further classified into loosely coupled federated database systems
and tightly coupled database systems according to who manages the federation and how
the component databases are co-ordinating.
In a loosely coupled federated database system it is in the responsibility of the user
to produce and manage the federation and there is no control from the federated database
system or the managers. These kinds of federated databases are also called interoperable
database systems. In a loosely federated database system multiple federated schemas are
supported. That happens because when we design a federated database system, we
actually also design one or more federated schemas according to the operation that are
applied by any the users or the application programs.
In a tightly coupled federated database system, the federated systems and its
managers have the responsibility of producing and managing the federation. They have
also the responsibility of controlling the components databases. These kinds of federated
database systems can have one or more federated schemas. If a federated database has
just one schema we say that it is a single federation, if it has multiple schemas we say that
they are multiple federations.
In figure 2 the classification of the multidatabase systems described above is
shown, as well as an indicative example for every category.
An analytical classification of the heterogeneous databases according to the three
criteria mentioned above can be found in [Ozsu and Valduriez, 1991].
Finally in [Bright et al. 1992] there is another classification of heterogeneous
systems. This one classifies the systems from the most tightly coupled ones to the most
loosely coupled ones. According to this classification we have : distributed systems, global
schema multidatabase systems, federated systems, multidatabase language systems,
homogeneous multidatabase language systems, and interoperable systems.
4. Problems in integration of heterogeneous databases
The heterogeneities in the schema and in the semantics of the component databases
are a serious problem in the design and the use of multidatabase systems. Here we
concentrate on some of the taxonomies based to the manner of the heterogeneities, which
are found in the references. These heterogeneities cause a lot of problems in integration of
heterogeneous databases.
In [Parent et al. 1992] and in [Reddy et al. 1994] there is a taxonomy of different
kinds of heterogeneities found in the schemas of the heterogeneous databases.
- The first kind is the semantic heterogeneities. This category includes the
heterogeneities caused by the fact that different database designers, have
different ways of understanding the same objects.
- The second kind is description heterogeneities. Here different database
designers describe the same object using different sets of characteristics.
- The third kind is model heterogeneities. Here we have all the heterogeneities
that are caused by the fact that different database designers use different
models to represent the same data.
- Finally we have the heterogeneities where although the same model is used to
represent the same data the structures used for this representation are different.
These heterogeneities are called structure heterogeneities.
In [Kim and Seo, 1991] there is a complete taxonomy of the types of
heterogeneities found in the multidatabase systems which are based entirely in the
relational model. These heterogeneities are divided in schema heterogeneities and data
heterogeneities. The reasons why we have schema heterogeneities is that (1) different
structures(tables, attributes) are used for the representation of the same information and
(2) for the same structures the different database systems may use different “design rules”.
The schema heterogeneities are divided into table-versus-table conflicts, attribute-versus-
attribute conflicts and table-versus-attribute conflicts. The data heterogeneities describe
the case where we have wrong data or different representation of the same data, in the
various database components found in a heterogeneous database.
5. Federated Multidatabase Systems
In recent years there has been a rapid trend toward the distribution of computer
systems over multiple sites that are interconnected via a communication network. A
Federated Database System (FDBS) is composed of heterogeneous hardware, operating
systems, database management systems and applications. Generally speaking, a federated
database system provides a logically integrated view of existing heterogeneous, distributed
databases. However, the implementation of a system capable of operation in a federated
database environment is a complex task.
5.1 Requirements for federated database management system
The data management systems are often database management systems or merely
file-based systems differing in several aspects such as data model, query language, system
architecture, etc. as well as the structure and semantics of the data managed. More recent
applications often require access to the distributed databases but their implementation fails
due to heterogeneity. FDBMS are designed to overcome this problem. Some
requirements needed to be attributed to FDBMS :
- A virtual, homogeneous interface (federated schema) has to be provided in order to
supersede this heterogeneity.
- The autonomy of the local data management system (component database system
CDBS) has to be maintained in order to guarantee that the local legacy application
scan continue to be used without any changes.
- The federated schema should facilitate the definition of an external schema, which can
either be a view or the same schema with some other underlying data model.
- The support of mechanisms to guarantee consistency (e.g. the transaction
mechanism).
Also please note that federated database systems have special requirements with
respect to integrity maintenance. For instance, it must be possible to express inter-
database dependencies in an adequate language and the federated system must be able to
enforce these constraints.
5.2 Characteristics of a federated database environment
In a centralized database system, all system resources such as data, DBMS
software, and applications reside at a single computer or site. In a FDBS system the
resources are often spread over multiple sites of a computer network, frequently
composed of heterogeneous hardware and software. Right now many organizations are
decentralized, each department own separate responsibilities, and a FDBS fits more
naturally in the structure of such organization. When several databases exist and there is a
need to perform some global applications, a FDBS is the natural solution.
The characteristics of a federated database environment are as follows :
- Hardware/Operating System Independence - a FDBS supports a heterogeneous
environment.
- Database Management System Independence - retrieval of data from databases on
the network is facilitated without having to take into account the data model
peculiarities.
- Distributed Query Processing - the system is capable of performing query processing
in a distributed database environment.
- Fragmentation Transparency - logical and physical fragmentation of the data on the
network is invisible to the users.
- Location Transparency - although we have multiple databases on the network they
are presented as if the data were stored at one site.
- Local Autonomy - in a geographically distributed database many applications are
local and a FDBS must allow for this locality of applications.
Federated databases are autonomous, cooperating, information resources. They
cannot and should not be directly integrated, both because of inherent differences in their
deep models and ontologies, and because of the need to keep maintenance local and
distinct. Besides that, a federated database system is a loose integration of stand-alone
database systems, where both global applications accessing multiple database systems, and
local applications are supported. Thus federated transaction management must support
concurrently global and local transactions.
5.3 Architecture of a federated system
The architecture of a federated multidatabase system consists of five levels
(figure 3):
- Local schema. The local schema is the conceptual schema of a component database. A
local schema is expressed in the native data model of the component DBMS, and
hence different local schemas may be expressed in different data models.
- Component schema. A component schema is derived by translating local schemas into
a data model called the canonical data model (CDM) of the FDBS. Two reasons for
defining component schemas in a CDM are (1) they describe the divergent local
schemas using a single representation and (2) semantics that are missing in a local
schema can be added to its component schema. Thus they facilitate negotiation and
integration tasks performed when developing either a tightly coupled FDBS or a
loosely coupled one.
- Export schema .An export schema represents a subset of a component schema that is
available to the FDBS. It may include access control information regarding its use by
specific federation users. The purpose of defining export schemas is to facilitate
control and management of association autonomy.
- Federated Schema. A federated schema is an integration of multiple export schemas. A
federated schema also includes the information on data distribution that is generated
when integrating export schemas. This schema supports the distribution feature of a
FDBS.
- External Schema. An external schema defines a schema of a user and/or application or
a class of users/applications. Reasons for the use of external schemas are as follows:
(1) Customization : An external schema can be used to specify a subset of information
in a federated schema that it is relevant to the users of the external schema.
(2) Additional integrity constraints can also be specified in the external schema.
(3) Access control : Export schemas provide access control with respect to the data
managed by the component databases, and to the data managed by the FDBS.
5.4 Some interesting topics
A main topic in federated system environments is the global identification of
objects. On the federation level, a framework for object identification is needed so as to
meet the demand for global uniqueness and to allow the users to access the objects in the
local systems.
Many enterprises suffer from the problems of heterogeneous database systems,
such as redundancy and lack of control. Existing approaches either try to couple the
database systems into a federation or migrate data to a single (often new) database system.
General FDBS concepts couple the multiple database systems and allow several degrees of
global control over heterogeneity and redundancy. However, existing federated database
systems provide uniform access to multiple heterogeneous DBS, but do not enable objects
to move across the DBS while retaining global identity.
In designing a federated database, we should take into account the transformation
of heterogeneous local schemas from the native data models of the component database
systems into a common data model, the integration of these homogenized schemas into the
federated schema, and the derivation of external schemas for global applications.
For resolving conflicts which arise during the transformation on the integration
process the consideration of integrity constraints can be useful, although the process
becomes more complicated. For certain kinds of conflicts we have to add new global
constraints, but often there is not a unique and predictable choice of additional global
constraints. The way of resolving such conflicts also depends on the intended semantics
of the federated schema and, thereby, on the intended global applications.
6. Data Sharing
It is obvious that in a heterogeneous database environment a simple user can face
many problems. These problems are caused because the user can not know all the details
of every database that participates in the heterogeneous environment that he/she wants to
use. As a result one of the major targets of a federated system is to support uniform
access to the data of the different component databases [ Ram 1991]. The problem of uniform and unified access to the data of the
component databases is called the problem of data sharing . In the relative research
papers and books someone can find three ways of dealing with the problem of data
sharing in the heterogeneous databases.
- Transfer all the data from one database to another. In this case the “original”
database is converted to another equivalent database (“target”) which is using
a different model to represent the information of the “original” database. This
can be done using a special tool called database converter. In this case what it
is actually done, is that : (1) A copy of the original database is created, (2) The
copy is converted to the model of the “target” database, (3) The final result
(target database) is given to the user.
- Create an intermediate schema, and transfer the data from every component
database which participates in the federation, to the database that is described
with this intermediate schema. In this case what is done is that a common
schema is created, which is a combination of the schemas of all the component
databases. Using this schema a new common database is also created and all
the data are transferred to it. The result that is given to the user is that common
database.
- Create an intermediate schema, which is a combination of all the schemas of
the component databases, and develop appropriate mechanisms for accessing
the data directly to the components databases, using that schema. This case is
similar to the one mentioned above, with the only difference that now there is
no need for a common database to be created (and, of course, of transferring
any data to it). In this case the data are remaining in the initial databases, and
the user can access them using the mechanism that are given to him.
The main problem in all the above cases is that the original and the target schema might
not be equivalent.
7. Object-Oriented Multidatabase Systems
Recently, many researchers have suggested using object-oriented techniques to
facilitate building multidatabase systems, As we all know object-oriented techniques have
been widely applied in database technology. Although using them in building
multidatabases seems promising, the lack of a common methodology impedes any further
development.
Until know the object technology has influenced the design and implementation of
multidatabase systems in the following three dimensions :
- System architecture. According to the architectural model, called Distributed Object
Architecture, the information stored in the database are modeled as objects and the
methods of retrieving and updating this information are modeled as the methods of the
objects
- Schema architectures. Several researchers on the area of databases, have recently
advocated the use of an object-oriented data model as the Common Data Model
(CDM) . The objects of the database model are of a finer granularity than the
distributed objects: at one extreme, an entire component database may be modeled as a
single distributed complex object.
- Transaction management. Object technologies has also influenced a number of aspects
of heterogeneous transaction management. It offers an efficient method of modeling
and implementation, facilitates the use of semantic information and has independently
introduced the notion of local transaction management.
8. Existing Heterogeneous Systems
In this section we give a brief description of systems that have been developed or
are being developed for either production use or academic purposes .
8.1 Non-object oriented federated databases
Here we review some of the existing (non-object oriented) federated databases.
(For each of them we list the category it belongs according to our classification of
heterogeneous databases listed in section 3)
- The Amoco Distributed Database System(ADDS) [Breietbart and Tienman 1985,
Breitbart et al. 1986] project began in late 1983, and it is one of the first projects in
this area. ADDS is a tightly coupled federated system supporting multiple federated
schema. It is based on the relational data model and uses an extended relational
algebra query language. Its local database schemas are mapped into multiple federated
database schemas, called Composite Database(CDB) definitions. The mappings are
stored in the ADDS data dictionary, which is fully replicated at all ADDS sites to
expedite query processing. A CDB is usually defined for each application, but there
exists also the case where one CDB is shared among applications. The CDBs support
the integration of the hierarchical, relational and network data models. Queries
submitted for execution are compiled and optimized for minimal data transmission
cost. ADDS maintains the autonomy of the local database systems and does not
require any modifications to local DBMS software. The only communication between
ADDS and the local DBMS is in the FORM of query submission and data retrieval.
- DATAPLEX [Chung 1990] is a heterogeneous distributed database system being
developed by General Motors Corporation, and it is a tightly coupled federated system
supporting multiple federated schemas. It allows queries and transactions to retrieve
and update distributed data managed by diverse data systems such that the location of
data is transparent to requesters. In this environment, different data management
systems can run on different operating systems that may be connected by different
communication protocols. The relational model of data is used as the global data
model. Since different data models used by unlike database systems structure data
differently, the data definition for each sharable database in the heterogeneous
distributed database system in transformed to an equivalent relational data definition or
conceptual schema. The conceptual schema is implemented as a set of overlapping
relational schemas, one for each location. The relations at each location represent data
objects that need to be accessed by users at that location. Consequently, conceptual
schemas are neither centralized nor replicated.
- The Integrated Manufacturing Data Administration System (IMDAS) [Barkmeyer et
al. 1986, Su et al. 1986] was developed to support a
prototype CIM environment . It is a tightly coupled federated system with a single
global schema. The integrating data model is the Semantic Association Model , a
semantic data model capable of representing the complex structures and relationships
and many integrity constraints found in a manufacturing enterprise. A fragmentation
schema maps the global model to the underlying databases, supporting both horizontal
and vertical partitioning of a given object class. Existing database systems are front
ended by IMDAS modules supporting an internal query interchange form, which is an
extended algebra on generalized relations corresponding to the modeled object classes.
In general IMDAS supports both distributed updates and distributed retrievals, but its
fragmentation schema does not support replication, which is a significant limitation of
the system.
- Ingres Corporation grew out of the INGRES [Stonebraker 1986] project at the
University of California at Berkeley. Ingres/STAR which provides transparent access
to distributed data, was first introduced in 1986. Ingres/STAR is a tightly coupled
federated supporting multiple federated schemas. The Ingres DBMS provides access
to an Ingres database, which is a named collection of tables. Ingres front-end programs
submit SQL queries to the Ingres DBMS to obtain data stored in the database. An
Ingres Gateway provides a method whereby data stored in other data managers is
made to appear as if it were in an Ingres database and thus is made available to Ingres
front-end programs. The Ingres/STAR system allows users to access a distributed
database, which is defined as a collection of tables from one or more Ingres databases.
Any set of tables from any set of Ingres databases can be combined to form a new,
distributed Ingres/STAR database. This includes not only databases under an Ingres
DBMS but also databases accessible via an Ingres/Gateway and other Ingres/STAR
databases. A single Ingres/STAR server may service multiple distributed databases,
and multiple Ingres/STAR servers may exist in the network. Access to the
Ingres/STAR distributed databases is transparent in the sense that once the database
has been created, the users of the database no longer need to know anything about the
existence of the individual Ingres databases that make up the distributed database.
- The development of the Mermaid [Templeton et al 1987] system was done at System
Development Corporation (later a part of Unisys). It is a tightly coupled federated
system supporting multiple federated schemas. In a sense, Mermaid is not a database
management system but rather a front-end system that locates and integrates data that
are maintained by local DBMSs. Parts of the local databases may be shared among
global users. Several levels of heterogeneity are supported by Mermaid: Hardware,
Operating System of the DBMS host, Network connection to the DBMS host, DBMS
type and query language, Data Model, Database schema. The system permits retrieval
across databases and updates to a single database. An interesting feature of it is that a
read transaction may see an inconsistent state of the database, since no local updates
may occur in the local databases during query execution. Mermaid minimizes the
window of inconsistency by making snapshots of all relations as a first step in
processing.
- MULTIBASE [Landers and Rosenberg 1982] was developed by Xerox advanced
Information Technology department, now the Advanced Information Technology
Division of Computer Corporation of America. It is a tightly coupled federated system
that provides the definition of multiple local schemas and multiple federated schemas
or views. Local schemas describe the data available at an individual local DBMS.
Views describe integrations of the data described in local schemas. The MULTIBASE
view mechanism is also used to resolve data incompatibilities that frequently arise
when separately developed and maintained databases are accessed conjointly.
Incompatibilities include (a) differences in naming conventions, underlying data
structures, representations, or scale (b) missing data, and (c) conflicting data values.
When defining a view, the database administrator applies knowledge of the local
databases to determine what incompatibilities might arise and what rules should be
used to reconcile them. The rules are included in the view definition, after which they
are followed automatically by the system in generating answers to queries.
- The Sybase, Inc was founded in 1984 with the goal of bringing a high-performance
distributed RDBMS to the market. SYBASE [Thomas et al. 1990] is the initial product
of Sybase Inc. After that, in 1990 Sybase introduced the Open Server, a product that
extends the SYBASE distributed capabilities to heterogeneous data sources. SYBASE
is a loosely coupled federated system. SYBASE attempts to open the architecture as
widely as possible to allow any database, application, or service to be integrated into
its client/server architecture in a heterogeneous environment. No global data model or
schema is enforced. Rather distributed operations can be supported via application
programming. This provides a high degree of site autonomy. In traditional centralized
database systems, users of on-line applications are not given direct update access to a
database but rather communicate with an application program that protects the
database from the user. This common approach can be called “application-enforced
integrity”. The legality of any update is determined principally by rules enforced by the
application program. Application-enforced integrity is, however a flawed approach in
heterogeneous distributed databases, where the application may be written in a
different department or in a different city for the DBA whose database is being
updated. A better alternative in a heterogeneous distributed database is to enforce data
integrity within the database itself. Under this alternative, an application at a remote
site communicates directly with a database that has sufficient richness of semantics to
decide by itself whether the transaction violates any integrity rules. Stored procedures
provide this capability. The SYBASE supports this second kind of integrity, In
particular the SYBASE open server provides a consistent method of receiving queries
from a application in the SYBASE and passing them to a non-SYBASE database or
application.
8.2 Object oriented databases
In this section we review some of the existing object-oriented multidatabase
projects.
- Pegasus [Ahmed et al. 1991, Connors and Lyngbaek 1988, Ahmed et al. 1993] is a
multidatabase system being developed by the Database Technology Department at
Hewlett-Packard Laboratories. Pegasus provides access to native and external
databases. A native database is created in Pegasus and both its schema and data are
managed by Pegasus. External databases are accessible through Pegasus, but are not
directly controlled by it. The important Features of Pegasus are:
- the way that it treats the semantic, schema and identity conflicts (See section 4
for the definitions of these conflicts),
- the way that the foreign functions are implemented,
- the support of cost-based or heuristic-based query optimization, depending of
the availability of statistical data .
- ViewSystem is an object oriented environment which has been developed as a first
prototype of the KODIM (Knowledge Oriented Distributed Information Management)
[Kaul et al.1991] which is mainly concerned with the dynamic integration of
heterogeneous and autonomously administered information bases. The view-system
provides an object oriented query language with extensive view facilities for defining
virtual classes from base classes. The ViewSystem is implemented in an object oriented
environment, namely the Smalltalk environment, and in this way benefits from a large
set of tools and reusable software. Its important characteristics are :
- It is embedded in an object-oriented programming environment and benefits
from reusable software
- Provides a concrete methodology for creating virtual classes based on a set of
class constructors
- Offers a hybrid approach to query processing
- Allows for organizing views in different modules
- The Operational Integration System (OIS) [Gagliardi et al.1990] is a generalized
integration tool that provides the application environments with a uniform interface for
accessing data managed by heterogeneous systems. These systems are expected to be
file systems, DBMSs, information retrieval systems, remote databank services and ad
hoc applications. OIS has been partially developed in the framework of the Esprit
Project 21009(TOOTSI). Its important feature is :
- The introduction of the concept of operational mapping along with an
implementation.
- The Comandos Integration System (CIS)[Bertino et al. 1989, Bertino et al. 1988] has
been implemented as part of the ESPRIT project COMANDOS. It has been used for
integrating several different application environment, including relational DBMSs,
graphical databases and public databanks. The interesting feature of this system is
similar to the one mentioned in OIS.
- The Object Management System (OMS)[Pathak et al. 1990; Heiler and Zdonik 1990]
is an object-based interoperability framework for engineering information systems
(EIS) designed at Xerox Advanced Information Technology (XAIT). Its important
features are :
- Definition of view facilities, and data sharing which is implemented by
delegation.
- Extended transaction model that supports cooperation between transactions and
user-specified correctness [Heiler et al.1992].
- The distributed object management system (DOMS)
[Buchmann et al.1992, Manola et
al. 1992] that is being developed at the GTE Laboratories, is an object-oriented
environment in which autonomous and heterogeneous local systems can be integrated
and native objects can be implemented. The local systems are not limited to database
systems but may be conventional systems, hypermedia systems, application programs
etc. A prototype DOMS was implemented connecting Apple Macintosh Hypercard
applications, the Sybase relational DBMS and the ONTOS object DBMS. The
prototype supports a simplified version of the data model and language and does not
currently support concurrency control and recovery facilities but supports a limited
form of “distributed commit”. Its important features are:
- Complete framework in the context of distributed object architecture.
- Support of transaction management that includes operations at transaction-less
systems, however transaction correctness and recovery is not formalized
especially under the presence of local transactions.
- An attempt to apply state-of-the-art knowledge at all parts of the system and
include most of the features that appear in the literature.
- UniSQL/M [Tamer et al.1991] is a heterogeneous database system, being
developed at UniSQL, that allows the integration of SQL-based relational database
systems and the UniSQL/X unified relational and object oriented-database system. Its
important features are:
- Systematic treatment of identity, schema, semantic conflicts.
- Commercial database system already released.
- The Carnot project at MCC[Woelk et al. 1993, Woelk et al. 1992,
Tomlinson et al. 1992, Collet et al. 1991] addresses the problem of logically unifying
physically distributed, enterprise-wide heterogeneous information, coming from a
variety of systems including database systems, database applications, expert systems,
and knowledge bases, business workflows, and the business organization itself. Its
important features are:
- Use of common-sense knowledge base as the global schema instead of a data
model.
- Implementation of the Extensible Service Switch (ESS) which provides
interpretive access to communication resources, local information resources and
applications at a local site.
- < A HREF="http://www.thule.no/thor/"> Thor [Liskov et al. 1992] is an object-oriented distributed DBMS being implemented
at MIT. Thor is intended to be used in heterogeneous distributed systems to allow
programs written in different programming languages to share a universe of persistent
objects in a convenient manner. Thor is not a multidatabase system since it does not
support the integration of preexisting systems, but rather a distributed database system
to share information by means of objects of the Thor’s universe. The reason that we
mention it here is that it provides a different approach to the problem of handling
heterogeneous information. A prototype of Thor, called TH has been implemented in
Argus. Its important feature (which is mainly due to the fact that it
is not an ordinary multidatabase systems (MDBS) is the following:
- Addressing performance issues, such as object-catching and combined
operations, as well as physical storage issues that are not considered by MDBSs
because such issues are handled by the local systems.
- FBASE [Mullen 1992] and InterBase [Bukhres 93] are two
prototype systems developed at Purdue University as part of the InterBase project.
FBASE concentrates on data modeling issues, while Interbase
provides complete transaction support. Currently, work is underway to integrate the
data model of FBASE in the Interbase system. The important feature of the former is:
- The class hierarchy provides a uniform way of mapping different data models to
the object oriented model.
The important features of Interbase are:
- Use of an extended transaction model.
- Use of a Transaction Specification Language to support the extended
transaction model.
- Treatment of commitment.
- The Federated Information Bases (FIB) project at Georgia Tech [Navathe et al. 1994]
focus mainly on the semantic interoperability problems encountered in multidatabase
systems. Its important features are:
- Use of classification to perform query processing and schema integration.
- Automation of the schema integration process.
9. Current issues
Heterogeneous systems are inevitable today. The high investitures in the software
and in the value of databases make homogeneous database systems, particularly in CAD,
very unlikely. Data are typically maintained on heterogeneous software and hardware
platforms caused by historical and technical reasons. The arising costs and the loss of
functionality when transferring the data to a monolithic system exclude a homogeneous
solution. Hence it follows that the combination of distributed heterogeneous system by a
mediator system is the only practical solution.
The mediator system to be developed has to be object-oriented, distributed and
active. It must furthermore provide complete OODBMS properties. Object-oriented
models have already proved their flexibility and are well suitable for such a mediator
system because they are easily extensible and able to represent objects of arbitrary
complexity. The system is furthermore distributed, including control structures. The
work at the Technical University Darmstadt, Germany is implementing the integration tool
Persistence, an extensible C++ interface to relational databases. In the multidatabase
testbed they apply Informix and Sybase as relational components, with an extended Sybase
interface to signal local events to the mediator.
There are many other research group devoting in this area. One of the theme of
Rodin Database Group in France is discovery and retrieval of information in
heterogeneous, autonomous data source over global network like Internet. ConcepBase
in Germany is a deductive object manager for meta-databases.
Stanford University, in California has the projects “C3”, researching in changes,
consistency and configurations in heterogeneous information system and “TSIMMIS”,
researching in information source wrapping and mediation across heterogeneous
information sources.
In addition to all the above, there are some other active projects nowadays in
particular in federated databases. The Land-Relational Information Systems (LRIS)
Network being developed in Alberta, Canada, demonstrates a practical implementation of
the query processing model in federated database. C-LAB (former Cadlab) in Germany
has a Database Federation project and Efendi, the federated database system, is developed
in this project as a software layer on top of multiple heterogeneous autonomous database
and file system. This is also demonstrated at SIGMOD'95 (San Jose, May 1995) and
ObjectWorld'95 (Frankfurt, October 1995). The demonstration shows how
heterogeneous, autonomous database and file systems are coupled to a federated database
systems. A software layer on top of them with an ODMG-extended interface allows to
access data of all the data handling systems as if it was a single logical database. It allows
to select, update, and create data globally. As the first federated database system,
OpenDM/Efendi allows to migrate data among the data handling systems while preserving
their global object identity.
When a component database participates in a multidatabase system, its data model is mapped to a data
model that is the same for all participating systems, which is called the Common(Canonical) Data
Model(CDM).
CIM stands for Computer Integrated Manufacturing, which is a manufacturing system which gives
direct data sharing among production control systems and the engineering and administrative systems
that support them.
On-line applications involve remote updates and have a strict requirement to maintain local site
autonomy.