Relational & Conventional Database Software - The Essential Differences (1990)
Bernard W Bennetto - Technology Translation Ltd
The paper considers the nature of the differences between relational and conventional database software and examines the impacts of such differences. It discusses the implications of the dropping of such established conventional features such as the use of hashed and chained data structures and the more widespread use of updatable and multiple index structures. It also examines the value of the more comprehensive and standardised SQL language and the incorporation of a set of integrated data dictionary tables. Other impacts which will be considered include the need for run time optimizers and the need for machine architectures which will exploit the access techniques common among relational technology. Finally, other concepts such as referential integrity rules, micro mainframe links, and views are examined.
In the early seventies, the original database offering was a hierarchical IBM database product called DL/1 or IMS. It had many warts and was criticised quite strongly at the time by database pundits who worked with the other suppliers, during the seventies, to develop and standardise the so called Codasyl, network based approach. IBM resolutely refused to join this austere group, despite the command which the Codasyl options subsequently gained in the market place in the early eighties, and instead chose to remove the warts and continue to work on System R which was based on Codd's relational approach. Eventually, we saw the emergence of both Oracle and DB2 (or SQL/DS) during the late eighties.
In approaching the potential use of relational database products such as DB2, Ingres and Oracle, it is essential to have some appreciation of the way in which they differ from conventional options such as IDMS/X and DL/1. An examination of such basic differences reveals why IBM regarded Codasyl as a blind alley and why it will obsolete by the end of the millennium. It also demonstrates the reasons why the relational approach will overtake other offerings and why, even now, the relational offering and the relational approach present significant advantages which are often unnoticed or understated.
A relational database consists of a collection of tables which correspond to record types in conventional database terminology. These tables are populated with what are termed rows and columns. The rows and columns correspond to records and fields respectively in conventional database terminology. The rows would consist of a set of order item details which would be found in their respective columns - the order number, the product number, the item quantity, the delivery date etc. In these respects the approaches are very similar.
The conventional database technology has always been regarded as the province of the data processing professional. The use of a separate series of terms was adopted because one of the original objectives was that the relational database technology should be accessible by a wider group of end users. To this end there was a need to communicate these concepts to an end user who could intuitively understand what is meant by a table and that such a table consists of a set of columns and rows of data. The end user's intuitive understanding of a record is something which you put on a record player turntable to produce music and that of a field is of a place where cattle and sheep graze.
Relational Data Structures
Apart from the adoption of a separate terminology, a further important respect in which the relational approach is different is that the physical organisation of the data is unimportant. Normally, data is added at the end of the table and the logical presentation is handled at the point when it is accessed. The data structures used are flat, sequential byte addressable files which involve a minimum of instructional overhead to maintain them.
Where the volume of rows is sufficiently small then the table is scanned for rows matching the requested selection criteria and presented in the sorted order of any one of its columns following an in memory sort. However, where the volume of rows is sufficiently large, the presentation of the data is normally supported by permitting each row of data to be uniquely or non-uniquely identified by one or more keys each of which will be implemented as an index. The choice and number of such keys is critical to the performance of the relational database.
Conventional Data Structures
The majority of conventional databases which I have encountered have been built almost entirely on the use hashed and chained structures and have placed little or no reliance on the use of indexed structures. The argument for the use of the hashed and chained structure is that an online application, given say an order number key, can 'hash to' the database page which holds the order details and 'walk' the chain to retrieve details of all the orders items associated with that order with a single physical I/O. This provides a minimal response time for a primary key enquiry on order number.
Despite the fact that the use of such structures is widespread for almost all conventional database systems, the relational approach has dropped this strategy totally and uses sequential flat files with primary key enquiries supported by indexes. Thus both order and order item detail records would be stored sequentially in a flat file structure and the order numbers for each set of records would be held in one or more index structures. This inevitably involves more physical I/Os to navigate the index structure and locate the page where the order and order item records are located.
Multiple Enquiry Structures
In a conventional database design, where there is a need to support a secondary key such as an enquiry on customer number then a chained link will be created between a third customer record and the order records. The enquiry will hash to the customer record with a single I/O but, less efficiently, it must then navigate the chain to locate the order records. Since these have been hashed to a random address the process must perform a physical I/O to retrieve each of these records.
In a relational database design, a further index to order details is created on customer number so as to satisfy this form of enquiry. This also incurs I/Os to navigate the index and then further I/Os to locate the order details which is also no better than the conventional design. Nevertheless, the relational database design does have advantages when both loadability and updatability are considered.
The problem for the conventional design is that in inserting a record into the database, one has to link that record to its owner/parent structures and update the next and prior pointers. This will involve accesses to find the appropriate owner/parent, determine the logical insert point, change pointers on both owners/parents and children and finally insert the order item record. This has to be done for each of the access paths - order number, customer number and product number. This navigation involves considerable instructional and I/O overhead.
The relational design involves merely accesses to maintain the index structures and the insertion. The need to navigate the chain linkages are totally bypassed. Similar improvements will also be found for update processes since they will also avoid the instructional and I/O overheads that are involved in maintaining the chain structures.
These advantages are somewhat diluted if the concept of referential integrity is implemented in a wholesale fashion. This concept refers to the resignation of the responsibility for ensuring that foreign key linkages are maintained intact and in a consistent fashion throughout the whole of the structure. While chains do not have to be walked, uncritical implementation will result in consistency checking being undertaken by the relational database software when it is not necessary.
Mass Storage Occupancy
Relational flat files have advantages in terms of mass storage occupancy since rows are normally added at the end of the file. They also save storage by offering the option of small integer, variable length and null column values. Conventional systems with their predisposition to hashing waste storage. Even the very best hashing algorithms, do not permit population of mass storage with a packing density of more than 70% without an unacceptably high number of our insertions being placed in overflow. Quite often packing densities of 40% are encountered.
As stated above, an objective of the relational approach was that it should be more accessible to the end user. While it has many shortcomings in terms of its end user usability, the Standard Query Language (SQL), commonly pronounced Sequel, has advantages in terms of its acceptance as a standard, its comprehensiveness and independence of data structures,
SQL has now been much more commonly adopted by many different suppliers than the conventional counterparts. Despite the fact that it still demonstrates differing dialects. it is not split so dramatically as the difference between the IBM DL/1 and the Codasyl approaches. It is also more comprehensive in that it offers:
- a DML for manipulating data
- a DDL for defining data
- A DCL for controlling access to the data.
The DML is also non-procedural. It can be used interactively by an end user as well as by embedding it into online and batch programs. It requires no knowledge of the underlying structure of the database in the way the hierarchical and Codasyl approaches require. The manipulation commands of the conventional approach can only be used within the context of a program and the programmer is restricted to certain manipulation commands for hash structures and others for chain structures.
The non-procedural aspect of the relational approach is made possible by the incorporation of a set of integrated data dictionary type or 'catalog tables'. These are built up through the use of the DDL and the DCL. The catalog tables consist of:
- logical tables which hold lists of the tables, keys, indexes, columns, rows which comprise the database
- physical tables which identify the volumes, files, index files on which the logical tables will be found
- access control tables which identify users and their authority to access particular rows and columns within particular tables ie. their privileges.
Unlike conventional databases which translate such definitions into binary representations, these tables are held in character format so that they can be accessed and displayed by any one holding the necessary privileges. They are an in-built data dictionary.
Run Time Optimizers
The non-procedural nature of the DML means that these tables are accessed at run time to establish whether the logical tables and columns exist, the location of the physical media on which they exist and whether the user has the privileges to access that data. A further impact is that there is a need for what is termed a 'run time optimizer' which examines what indexes exist to meet that particular access requirement and to build a 'strategy' for meeting this requirement in the most optimal fashion.
It is not necessary to invoke this overhead on each and every occasion that this access requirement is called. The data dictionary of system tables also contains a library of the available processing strategies so as to shortcut the initial overhead. They must however be re-worked whenever a relevant change in the data structure takes place as a result of data re-definition.
The use of thtese tables does causes relational database software to be considerably more processor, memory and I/O intensive than conventional database software such as DL/1 and IDMS. It does involves both extra instructions and I/O in determining the optimal access paths and in navigating both the indexes and catalog structures. This is the cost of both data structure and procedural flexibility.
Nevertheless, it is prompting improvements in machine and software architectures which will ameliorate the impacts of some of these overheads. The availability of much cheaper memory and the ability to address a much larger physical and virtual memory space means that the catalogs can be permanently located into memory. Frequently used indexes can also be placed in such storage and non-resident table and index scans are ideally suited to the CAFS approach. The additional processor overheads are handled by processors which are now capable of handling more than 100 million instructions per second.
The buffer management mechanisms of the conventional database offerings are antiquated and archaic and much more appropriate to the lack of large amounts of main memory. Page sizes of the relational offerings usually offer a big (say 32k) page option and a small (say 4k) page option. They also offer a much larger number of buffers with the ability to lock certain pages into a buffer. Moreover, they also offer several categories of buffer so that you could have separate buffers for the system tables, the indexes and the tables themselves and avoid any undue, nugatory interaction between them.
The technology which has been adopted to support the implementation of indexing is much more advanced than that in use by conventional offerings. Since indexing does not play a major role in many conventional designs, many conventional counterparts require frequent re-construction of their indexes and some do not even permit the updating of indexes! The so called B tree approach which is used by Oracle and DB2 offers a much more maintainable approach. Both offer the option of partitioning indexes so as to limit the number of index levels and clustered indexes which ensures that rows are placed in the table space in the order in which they appear in the index.
Additionally, relational database software is more oriented to the end user since each end user is permitted to have his own 'view' of the data. This might consist of:
- a subset of all available rows ie. the invoicing section would deal with completed items and the progress control section would deal with outstanding order items
- a subset of the columns so that the view of the invoicing section would only include the delivered quantities column and exclude the quantity outstanding column
- presentation according to different logical criteria so that the progress control section will view their data in age of order criteria.
Such views carry a degree of performance overhead since queries against them cannot be pre-defined - they must be re-optimized on each occasion they are used.
Therefore, a serious cause for concern whenever one is involved in the design of a relational database is the possibility that a number of users will simultaneously initiate enquiries to create views which will result in a full table scans or a merge (join) of a number of tables. This can now be controlled by such strategies as a systematic analysis of all their potential requirements, aborting such jobs after they have consumed more than a pre-determined budget of CPU time and disk time and by building stand alone structures. A further approach is to make use of the view concept and transfer a limited amount of data for local access and manipulation via micro mainframe links. MIcro architectures can now support the micro implementation of such products as Oracle and Ingres and there is a large degree of correspondence between spreadsheets and the table concept. These serve to ameliorate the problems raised by the view.
Relational database technologies attempt to bring database access much closer to the ultimate end user. As such they are intutitively understandable and, where volumes are small, require minimal intervention from the data processing professional. Where volumes are large, the performance overheads are not as significant as those which have been previously implied. Furthermore, the advances in hardware and software architectures are serving to further ameliorate these differences.
This paper was presented at the AMSU conference held in London in May 1990
© Copyright rests with Bernard W Bennetto and Technology Translation Limited.
Last updated: 11/04/97 19:55:07
Authored and Designed by:
Bernard W Bennetto email@example.com
© Bernard W Bennetto 1996 - 1997