Relational Database Software - Exposing Performance Fallacies (1989)
Bernard W Bennetto - Technology Translation Ltd
There is a widespread belief that the use of relational database software is not appropriate when performance is a major consideration. This paper attempts to demonstrate that this is not necessarily the case. Relational database software will be analysed so as to demonstrate certain advantages such as on-line update response, loadability, mass storage occupancy and sequential retrieval and updating. The narrowness in the use of the term 'performance' will also be examined by a consideration of other criteria such as flexibility, end user accessibility, privacy/security and software/hardware compatibility. Examples will be drawn from DB2.
Relational database software IS considerably more processor, memory and I/O intensive than conventional database software such as DL/1 and IDMS. It involves both extra instructions and I/O in determining the optimal access paths and in navigating both the indexes and catalog structures. As such, it is not intuitively considered for applications where performance is a major consideration.
Nevertheless, despite these additional overheads, as both a database design and performance management consultant I have experience of several instances where relational database software demonstrates certain advantages over its conventional counterparts. These advantages derive from the fact that relational database software uses data structures which are more akin to flat files rather than to the variety of indexed, hashed and chained structures which might be involved in locating the owners/parents and their parent/child or set constructions (see Figure 1).
This realisation came while I was leading a database administration team involved in the initial designs of a Codasyl database for the Dutch Vehicle Registration Centre. This database needed to hold the details of the eight million or so registered vehicles in Holland and needed to provide online access to these details by three access paths - vehicle registration mark (VRM), chassis number and registered keeper (see Figure 2).
The first design for this structure was subjected to a detailed performance analysis and estimates were prepared of the response times for online processes and the elapsed times for the batch and load processes. While online and batch timings were acceptable, the time for the load was somewhat greater than 30 days on dedicated equipment! This was totally unacceptable.
The source of the problem was that in inserting a record into the database, we had to link that record to its owner/parent structures and update the next and prior pointers. This involved accesses to find the appropriate owner/parent, determine the logical insert point, change pointers on both owners/parents and children and finally insert the vehicle record. This had to be done for three access paths - chassis number, VRM and registered keeper. This navigation involved considerable instructional and I/O overhead.
The answer to our dilemma came from the senior member of the design team who remembered a seminar that had been conducted by the design staff of the Trustee Savings Bank. During the course of that seminar they had expressed certain concerns over the use of chain structures. He suggested, therefore, that we should consider the adoption of a relational approach with flat file structures and with foreign key linkages for two of our access paths (see Figure 3).
This involved merely two sets of accesses to the two foreign structures, two modifications and an insertion. We re-ran the performance analysis programs and the load time was estimated to be three days - a dramatic improvement.
We also noticed other improvements. The first was that the online update processes also improved since they used a similar access strategy to the one that the load program was using - they avoided the instructional and I/O overheads that were involved in maintaining the chain structures.
Mass Storage Occupancy
The use of flat files also have advantages in terms of mass storage occupancy. Even with the very best hashing algorithms, which were thoroughly researched from the current literature, we found that we could not populate our mass storage with a packing density of more than 70% without an unacceptably high number of our insertions being placed in overflow. It also did not make sense to use hashed structures when new vehicle registrations occurred at the end of the filestore. Therefore we decided to limit our use of hashed structures, and we found that the amount of mass storage that we required was also dramatically improved - down by approximately 20%.
The DB2 relational database software uses a flat file VSAM ESDS approach and therefore is also less loathe to use mass storage than conventional database software which pre-disposes one towards a hashed file approach (for reasons which I cannot wholly determine). It also saves storage by offering the option of small integer, variable length and null column values, which while generating performance considerations which one cannot consider here, nevertheless means that less mass storage is consumed.
At the Dutch Vehicle Registration Centre we also found that the batch processes also showed improvement. Using sequential flat files, the batch processes traversed a smaller number of well populated pages rather than a much larger number of more sparsely populated pages which a hashed structure would have involved (see Figure 4).
Moreover, in certain cases, they did not have to concern themselves with the instructional overheads of navigating a chain parent/child type structure. In addition, certain of the key batch processes also received their records in a sorted logical order as opposed to the random order which a hashed structure would have imposed.
DB2 offers batch processes the opportunity to carry out a full tablespace scan of well packed pages and a variety of index based approaches to scanning the tablespace. Like a VSAM KSDS, it offers the opportunity to interrogate index key values to determine whether that tablespace page needs be accessed at all and 'clustered' indexes so that the pages are read in an ESDS fashion.
This fact has been reinforced by visits to other sites where they had difficulties in meeting their overnight processing schedules because of such considerations. The batch processes walk more pages and conduct more sorts than they need because of the use of hashed and chained structures. In a banking environment, where the overnight schedule is critical, such performance considerations might be far more important than an online transaction responding in two rather three seconds - the typical gain you might get from using a hashing approach rather than one based on an index.
At the Vehicle Registration Centre in Holland, we were not only concerned with the fact that the database would take thirty days to load, but that it would also take thirty days to RELOAD. No member of the design team was confident that we would get it right first time and we realised that any change might involve both reloading the database and re-programming the applications. Conventional database software does NOT have the necessary degree of data independence or tuning capability (despite the claims) that mistakes could be corrected without substantial effort.
The fact that a re-load could be carried out within three days generated the confidence that, within a bank holiday weekend, new structures could be added. The 'relational database approach' was considerably more flexible despite the fact that it would not be so easy to amend program structures given that they navigated a path 'self-consciously'. A program needed to be aware as to whether it was walking a hashed, indexed or chained structure.
This lesson has been reinforced by subsequent experience. I have been asked to audit a number of Codasyl and hierarchical database designs at a point where programs were written and the application was about to be implemented. Unfortunately, there was little that could be achieved since they do not lend themselves to re-organisation or tuning without major re-runs and re-programming - the cost of which would have been greater than the equivalent upgrade.
DB2 achieves a far greater degree of what is termed data independence. Its VSAM ESDS flat file approach and use of the catalog structures allows one to load and reload structures with almost equivalent gains to those achieved in the Dutch project. It also allows one to drop and create indices at will with potential gains for both update processes and sequential processing. Moreover, because SQL is non-procedural, programs can be re-prepared and re-bound with the amended data structures. Inefficient, embedded SQL structures can also be much more easily replaced by more efficient structures.
There are numerous features within relational database software which recognise the I/O, memory and processor overheads and which provide the designer or tuner with tools to counteract these overheads. In DB2 these include:
- an EXPLAIN facility which documents the results of the optimizer and bind process (even though it does not unlock the secrets of the Optimizer itself)
- an overflow organisation which results in the location of the misplaced or overflowing row in a single I/O and means to determine the extent of overflow displacement
- the use of a space map which can be used to report on space usage and which also tells DB2 whether there is sufficient space on the page for a row insertion - therefore avoiding the need for accessing that data page
- a mechanism whereby an index page can be locked in quarters or sixteenths and which is therefore not wholly locked
- a four part buffer management structure which enables one to isolate the potentially conflicting buffering requirements of different catalog, table and index spaces
- the opportunity to nominate an indexspace or tablespace as a linear data set (LDS) and locate it permanently in memory
- a track record of consistent performance improvements with each new release of the software
- the development of DB2 as a strategic product so that it can fully exploit the benefits of Extended Architecture (XA) and hierarchical storage mechanisms.
There seems to be some mystical belief that because conventional database software is more efficient in terms of processor, memory and I/O usage there is little need to consider its performance implications. Database designers bow themselves before the god of normalisation and avoid any hint of data redundancy. They use the myriad access structures of the conventional database software to 'cook up' what I would refer to as complicated 'bowls of spaghetti' - the best description of the pictorial representation contained in the Bachmann diagrams of their resultant design. Such structures often result in online transactions and batch processes which incur twenty, thirty or forty I/Os per work cycle - a figure which I consider totally unacceptable and which is greater than the sort of values which might be achieved by relational database software though sensible database design. Forty I/Os would be equivalent to accessing ten rows through ten separate four level indexes!
If these designers have a performance objective, then it seems to me that it is that their online processes should be met within the minimum response time which might be achievable - notwithstanding the normalisation goal. They neglect the batch implications and other performance considerations.
A database designer who is working with relational database software is more likely to accept the software overheads and is more inclined to denormalisation and the use of what is best termed 'controlled redundancy' - the periodic creation of freestanding structures for navigation by particular processes. The teachers of courses on conventional database design appear to be overemphasising data analysis, normalisation and the avoidance of redundancy at the expense of other 'performance' considerations.
Definition of Database Design Performance Objectives
The word 'performance' has been placed in inverted commas so as to draw attention to the need for a wider definition of database design performance than has previously been considered. Codd's normalisation rules should be regarded as a data analysis tool and not a design objective. There is NO need to implement wholly normalised structures if future processing does not require it and there IS a need to create 'controlled' redundant structures if current processing dictates it.
Using the Dutch Vehicle Registration as a further example, while our data analysis had indicated the existence of insurance company, insurance branch office and insurance policy entities there was no need to create separate tables for these entities. Provision was merely made for holding the insurance policy number as a foreign key in the vehicle record. There was also a need to create separate redundant mini-structures for processing such as partial VRM, make and body colour searches which involved large or full tablespace scans.
The definition of performance should also be widened to extend beyond the goals of meeting online objectives and normalisation for future flexibility. It should include the capability of meeting batch objectives, re-organisability, availability, programmability, maintainability and end user accessibility. It should also be widened so that it includes more general objectives such as the use of integrated privacy and security, data definition capabilities and portability across hardware. These are features which are defined within standard
I do not wish to conclude by over-emphasising the benefits and advantages of the relational database software approach. Firstly, many of these design principles can be achieved using conventional database software (and in fact were, because the design of the Dutch Vehicle Registration database took place in 1984 at a time prior to the first shipments of the pre-release version of DB2). I also see that particular difficulties will arise in determining how a site establishes a testing environment for DB2, in exploiting and implementing referential integrity features and in formulating standards for the use of views.
Nevertheless, all is not well in the world of conventional database design. Given that satisfactory performance objectives can be laid down for a relational database design and that the design exercise can be approached pragmatically, there is no reason why relational database software should not be used for applications requiring substantial peak online arrival rates. While it may not be as high as the suppliers claim, there is less of a discrepancy than one might think.
This paper was first presented at UKCMG 1989 - the annual conference ot the United Kingdom Computer Measurement Group at Glasgow in May 1989.
© Copyright rests with Bernard W Bennetto and Technology Translation Limited.
Last updated: 11/04/97 19:55:07
Authored and Designed by:
Bernard W Bennetto firstname.lastname@example.org
© Bernard W Bennetto 1996 - 1997