INTRODUCTION TO HETEROGENEOUS DATABASES

Contents

  1. Introduction
  2. Basic Definitions
  3. Classification Of Heterogeneous Databases
  4. Problems In Integration Of Heterogeneous Databases
  5. Federated Multidatabase Systems
    5.1 Requirements for federated database management system
  6. 5.2 Characteristics of a federated database environment
    5.3 Architecture of a federated system
    5.4 Some Interesting Topics
  7. Data Sharing
  8. Object-Oriented Multidatabase Systems
  9. Existing Heterogeneous Systems
    8.1 Non-object oriented federated databases
    8.2 Object oriented databases
  10. Current Issues



1. Introduction

The integration of distributed database systems poses many challenges due to differences in the data management systems (i.e. different vendors),in the data models (i.e., relational, text indexing),in the query and the transaction processing algorithms, in the data type (i.e. text, graphics, multimedia, hypermedia, sensor, knowledge bases), in the format (i.e., structured, unstructured), and in the semantics.

Within the intelligence community, many volumes of heterogeneous legacy intelligence databases exist. For example, an intelligence analyst may have to look for daily incoming messages in Verity TOPIC, specific domain data in DB2, archival messages in an older proprietary database, and a group discussion of a related topic in Lotus Notes. Ultimately, the goal is for the analyst to utilize each of these data sources to obtain useful information in a timely and simplistic manner.

2. Basic Definitions

A distinguishing property of a distributed database is that it can be homogeneous or heterogeneous. A homogeneous distributed database is one where all the local databases are managed by the same DBMS. This approach is the simplest one and provides incremental growth, which makes the addition of a new site in the network easy, and increased performance, by exploiting the parallel processing capability of multiple sites.

A heterogeneous distributed database is one where the local databases need not be managed by the same DBMS. For example, one DBMS can be a relational system while another can be a hierarchical system. This approach is far more complex than the homogeneous one but enables the integration of existing independent databases without requiring the creation of a completely new distributed database. In addition to the main functions, the distributed DBMS must provide interfaces between the different DBMS.

A federated database is a combination of autonomous, heterogeneous databases which are operating together.

The software which co-ordinate all these different databases is called federated database management system (FDBMS) and its form can be viewed in figure 1. The basic characteristic of a federated system is the co-ordination among independent database systems.

Every database which takes part in the federation is called a component database. One component database can take part in more than one federations, and its database manager can be centralized or distributed. One important characteristic of a federated management system is that every component database can continue executing its local tasks, while it is participating in the federation.

The federated database management systems are subsets of a more general category of database systems, the multidatabase systems.

Finally a systems that allows periodical data transfer among the different databases is called data exchange system.

3. Classification of heterogeneous databases

Approaches to managing heterogeneous databases include linking heterogeneous databases via the World Wide Web(WWW), organizing them into database federations or multidatabase systems, and constructing data warehouses. Common to these approaches is allowing component databases to preserve their autonomy, that is, their local definitions, applications, and policy of exchanging data with other databases.[Bright et al. 1992].

Heterogeneous database systems have been traditionally classified by the type of schemas, by their extent of data sharing, and by the data access facilities they support.

Schemas supported by a heterogeneous database system include :

  1. local views expressed representing the schemas of component databases expressed in the Data Definition Language (DDL) of local databases; and

  2. a global schema expressed in a common DDL, providing a unified view of the schemas of all component databases.

Data sharing in a heterogeneous databases system can be at the level of :

  1. linking specific data items in the component databases; or

  2. generic (schema driven) correlations across component databases.

Individual data item links (e.g. hypertext links) between databases do not require or comply with schema correlations across databases. For schema correlations, data links need to be consistent with the constraints entailed by these correlations, such as inter- database referential integrity constraints.

Data access facilities in a heterogeneous database system can range from :

  1. browsing across component databases; to

  2. querying a centralized data warehouse; to

  3. querying multiple databases.

Browsing across component databases is usually based on traversing WWW hyperlinks between data items in a database to data items in another database, and does not require schema correlations. Querying a data warehouse amounts to querying a single database, where the data of all component databases are represented according to the global schema of the warehouse. Querying multiple databases is carried out by expressing queries over the global schema of the heterogeneous database system or over the component database local views to queries for component databases. Alternatively, a heterogeneous database system can be provided with a multidatabase query language that allows expressing queries that refer directly to elements of component databases.

In addition to all the above the multidatabase systems can be classified according to the following criteria : distribution, heterogeneity, autonomy[Sheth and Carson, 1990].

Distribution

The data can be distributed in many databases. These databases can be in the same or in different computer systems. The advantage of the distribution of data is that we can now access them more easily. In the case of federated databases, the reason for this distribution is that these databases existed before their federation was formed.

Heterogeneity

Many types of heterogeneity are due to technological differences such that differences exist in the hardware, in the software, or in the operating system. Many software engineers have dealt with this problem, and as a result, there are a lot of commercial database management systems, nowadays, that they work perfectly under such heterogeneous conditions. Generally we can categorize the heterogeneities into 2 sub- categories. The heterogeneities that are due to differences of the various database management systems that are invoked in a heterogeneous database, and the heterogeneities that are due to differences in the semantics of the data.

In the first sub-category we have all the heterogeneous databases that include databases which have differences in the structures used, in the restrictions in the domains, or in the query languages that each one uses.

In the second sub-category we have the heterogeneous databases that include databases which have differences in the representing meaning of the data or in the translation or the use of the same or relative data. Even the finding of these kinds of heterogeneities is indeed a difficult problem[Seltor et al., 1993].

Finally, we should point out that one important problem of co-operation of heterogeneous databases : the finding and solving of the heterogeneities that they have.

Autonomy

One component database, which is part of a federated database system, can have many kinds of autonomies.

1. Design autonomy
This is the ability that a certain database can choose its own design. (In other words to choose its own data, its own way of representing these data, and its own domain of its attributes.)

2. Communication autonomy
This is the ability of a component database to decide on its own when it “wants” to communicate with other component databases, or with the federated database system.

3. Execution autonomy
This is the ability of a component database to execute locally some operations, without being influenced by the external operations that are executed in other component databases or in the federated database system. This implies that a certain component database can reject any operation that is against its local constraints.

4. Association autonomy
This is the ability of a component database to decide by itself how much and for how long it would share its resources with other component databases or with the federated database system. In this autonomy it is also included the ability of this certain component database to correlate or not itself with the federation and the ability to participate into one or more federations.

The multidatabase systems can be classified into 2 categories according to the association autonomy of their component databases : non-federated database systems and federated database systems (See Sections 1 and 5 also).

A non-federated database system is a collection of database systems that are not autonomous. This system has just one level of management and all the operations are applied to all the components databases. In such a system there is no distinction between local and non-local users or even operations. If all the database components are fully associated, the non-federated systems is called unified multidatabase system.

A federated database system consists of component databases, that although they are autonomous, they participate in the federation and they allow partial and controlled sharing of their data. The federated database systems are in the middle between no co- ordination and fully co-ordination among heterogeneous databases. The federated database systems can further classified into loosely coupled federated database systems and tightly coupled database systems according to who manages the federation and how the component databases are co-ordinating.

In a loosely coupled federated database system it is in the responsibility of the user to produce and manage the federation and there is no control from the federated database system or the managers. These kinds of federated databases are also called interoperable database systems. In a loosely federated database system multiple federated schemas are supported. That happens because when we design a federated database system, we actually also design one or more federated schemas according to the operation that are applied by any the users or the application programs.

In a tightly coupled federated database system, the federated systems and its managers have the responsibility of producing and managing the federation. They have also the responsibility of controlling the components databases. These kinds of federated database systems can have one or more federated schemas. If a federated database has just one schema we say that it is a single federation, if it has multiple schemas we say that they are multiple federations.

In figure 2 the classification of the multidatabase systems described above is shown, as well as an indicative example for every category.

An analytical classification of the heterogeneous databases according to the three criteria mentioned above can be found in [Ozsu and Valduriez, 1991].

Finally in [Bright et al. 1992] there is another classification of heterogeneous systems. This one classifies the systems from the most tightly coupled ones to the most loosely coupled ones. According to this classification we have : distributed systems, global schema multidatabase systems, federated systems, multidatabase language systems, homogeneous multidatabase language systems, and interoperable systems.

4. Problems in integration of heterogeneous databases

The heterogeneities in the schema and in the semantics of the component databases are a serious problem in the design and the use of multidatabase systems. Here we concentrate on some of the taxonomies based to the manner of the heterogeneities, which are found in the references. These heterogeneities cause a lot of problems in integration of heterogeneous databases.

In [Parent et al. 1992] and in [Reddy et al. 1994] there is a taxonomy of different kinds of heterogeneities found in the schemas of the heterogeneous databases.

In [Kim and Seo, 1991] there is a complete taxonomy of the types of heterogeneities found in the multidatabase systems which are based entirely in the relational model. These heterogeneities are divided in schema heterogeneities and data heterogeneities. The reasons why we have schema heterogeneities is that (1) different structures(tables, attributes) are used for the representation of the same information and (2) for the same structures the different database systems may use different “design rules”. The schema heterogeneities are divided into table-versus-table conflicts, attribute-versus- attribute conflicts and table-versus-attribute conflicts. The data heterogeneities describe the case where we have wrong data or different representation of the same data, in the various database components found in a heterogeneous database.

5. Federated Multidatabase Systems

In recent years there has been a rapid trend toward the distribution of computer systems over multiple sites that are interconnected via a communication network. A Federated Database System (FDBS) is composed of heterogeneous hardware, operating systems, database management systems and applications. Generally speaking, a federated database system provides a logically integrated view of existing heterogeneous, distributed databases. However, the implementation of a system capable of operation in a federated database environment is a complex task.

5.1 Requirements for federated database management system

The data management systems are often database management systems or merely file-based systems differing in several aspects such as data model, query language, system architecture, etc. as well as the structure and semantics of the data managed. More recent applications often require access to the distributed databases but their implementation fails due to heterogeneity. FDBMS are designed to overcome this problem. Some requirements needed to be attributed to FDBMS :

  1. A virtual, homogeneous interface (federated schema) has to be provided in order to supersede this heterogeneity.

  2. The autonomy of the local data management system (component database system CDBS) has to be maintained in order to guarantee that the local legacy application scan continue to be used without any changes.

  3. The federated schema should facilitate the definition of an external schema, which can either be a view or the same schema with some other underlying data model.

  4. The support of mechanisms to guarantee consistency (e.g. the transaction mechanism).

Also please note that federated database systems have special requirements with respect to integrity maintenance. For instance, it must be possible to express inter- database dependencies in an adequate language and the federated system must be able to enforce these constraints.

5.2 Characteristics of a federated database environment

In a centralized database system, all system resources such as data, DBMS software, and applications reside at a single computer or site. In a FDBS system the resources are often spread over multiple sites of a computer network, frequently composed of heterogeneous hardware and software. Right now many organizations are decentralized, each department own separate responsibilities, and a FDBS fits more naturally in the structure of such organization. When several databases exist and there is a need to perform some global applications, a FDBS is the natural solution.

The characteristics of a federated database environment are as follows :

Federated databases are autonomous, cooperating, information resources. They cannot and should not be directly integrated, both because of inherent differences in their deep models and ontologies, and because of the need to keep maintenance local and distinct. Besides that, a federated database system is a loose integration of stand-alone database systems, where both global applications accessing multiple database systems, and local applications are supported. Thus federated transaction management must support concurrently global and local transactions.

5.3 Architecture of a federated system

The architecture of a federated multidatabase system consists of five levels (figure 3):

5.4 Some interesting topics

A main topic in federated system environments is the global identification of objects. On the federation level, a framework for object identification is needed so as to meet the demand for global uniqueness and to allow the users to access the objects in the local systems. Many enterprises suffer from the problems of heterogeneous database systems, such as redundancy and lack of control. Existing approaches either try to couple the database systems into a federation or migrate data to a single (often new) database system. General FDBS concepts couple the multiple database systems and allow several degrees of global control over heterogeneity and redundancy. However, existing federated database systems provide uniform access to multiple heterogeneous DBS, but do not enable objects to move across the DBS while retaining global identity.

In designing a federated database, we should take into account the transformation of heterogeneous local schemas from the native data models of the component database systems into a common data model, the integration of these homogenized schemas into the federated schema, and the derivation of external schemas for global applications.

For resolving conflicts which arise during the transformation on the integration process the consideration of integrity constraints can be useful, although the process becomes more complicated. For certain kinds of conflicts we have to add new global constraints, but often there is not a unique and predictable choice of additional global constraints. The way of resolving such conflicts also depends on the intended semantics of the federated schema and, thereby, on the intended global applications.

6. Data Sharing

It is obvious that in a heterogeneous database environment a simple user can face many problems. These problems are caused because the user can not know all the details of every database that participates in the heterogeneous environment that he/she wants to use. As a result one of the major targets of a federated system is to support uniform access to the data of the different component databases [ Ram 1991]. The problem of uniform and unified access to the data of the component databases is called the problem of data sharing . In the relative research papers and books someone can find three ways of dealing with the problem of data sharing in the heterogeneous databases.

  1. Transfer all the data from one database to another. In this case the “original” database is converted to another equivalent database (“target”) which is using a different model to represent the information of the “original” database. This can be done using a special tool called database converter. In this case what it is actually done, is that : (1) A copy of the original database is created, (2) The copy is converted to the model of the “target” database, (3) The final result (target database) is given to the user.

  2. Create an intermediate schema, and transfer the data from every component database which participates in the federation, to the database that is described with this intermediate schema. In this case what is done is that a common schema is created, which is a combination of the schemas of all the component databases. Using this schema a new common database is also created and all the data are transferred to it. The result that is given to the user is that common database.

  3. Create an intermediate schema, which is a combination of all the schemas of the component databases, and develop appropriate mechanisms for accessing the data directly to the components databases, using that schema. This case is similar to the one mentioned above, with the only difference that now there is no need for a common database to be created (and, of course, of transferring any data to it). In this case the data are remaining in the initial databases, and the user can access them using the mechanism that are given to him.

The main problem in all the above cases is that the original and the target schema might not be equivalent.

7. Object-Oriented Multidatabase Systems

Recently, many researchers have suggested using object-oriented techniques to facilitate building multidatabase systems, As we all know object-oriented techniques have been widely applied in database technology. Although using them in building multidatabases seems promising, the lack of a common methodology impedes any further development.

Until know the object technology has influenced the design and implementation of multidatabase systems in the following three dimensions :

8. Existing Heterogeneous Systems

In this section we give a brief description of systems that have been developed or are being developed for either production use or academic purposes .

8.1 Non-object oriented federated databases

Here we review some of the existing (non-object oriented) federated databases. (For each of them we list the category it belongs according to our classification of heterogeneous databases listed in section 3)

8.2 Object oriented databases

In this section we review some of the existing object-oriented multidatabase projects.

9. Current issues

Heterogeneous systems are inevitable today. The high investitures in the software and in the value of databases make homogeneous database systems, particularly in CAD, very unlikely. Data are typically maintained on heterogeneous software and hardware platforms caused by historical and technical reasons. The arising costs and the loss of functionality when transferring the data to a monolithic system exclude a homogeneous solution. Hence it follows that the combination of distributed heterogeneous system by a mediator system is the only practical solution.

The mediator system to be developed has to be object-oriented, distributed and active. It must furthermore provide complete OODBMS properties. Object-oriented models have already proved their flexibility and are well suitable for such a mediator system because they are easily extensible and able to represent objects of arbitrary complexity. The system is furthermore distributed, including control structures. The work at the Technical University Darmstadt, Germany is implementing the integration tool Persistence, an extensible C++ interface to relational databases. In the multidatabase testbed they apply Informix and Sybase as relational components, with an extended Sybase interface to signal local events to the mediator.

There are many other research group devoting in this area. One of the theme of Rodin Database Group in France is discovery and retrieval of information in heterogeneous, autonomous data source over global network like Internet. ConcepBase in Germany is a deductive object manager for meta-databases.

Stanford University, in California has the projects “C3”, researching in changes, consistency and configurations in heterogeneous information system and “TSIMMIS”, researching in information source wrapping and mediation across heterogeneous information sources.

In addition to all the above, there are some other active projects nowadays in particular in federated databases. The Land-Relational Information Systems (LRIS) Network being developed in Alberta, Canada, demonstrates a practical implementation of the query processing model in federated database. C-LAB (former Cadlab) in Germany has a Database Federation project and Efendi, the federated database system, is developed in this project as a software layer on top of multiple heterogeneous autonomous database and file system. This is also demonstrated at SIGMOD'95 (San Jose, May 1995) and ObjectWorld'95 (Frankfurt, October 1995). The demonstration shows how heterogeneous, autonomous database and file systems are coupled to a federated database systems. A software layer on top of them with an ODMG-extended interface allows to access data of all the data handling systems as if it was a single logical database. It allows to select, update, and create data globally. As the first federated database system, OpenDM/Efendi allows to migrate data among the data handling systems while preserving their global object identity.

When a component database participates in a multidatabase system, its data model is mapped to a data model that is the same for all participating systems, which is called the Common(Canonical) Data Model(CDM).

CIM stands for Computer Integrated Manufacturing, which is a manufacturing system which gives direct data sharing among production control systems and the engineering and administrative systems that support them.
On-line applications involve remote updates and have a strict requirement to maintain local site autonomy.