Denormalisation - An `Optimal Database Design' Tactic (1993)

Bernard W Bennetto and Matthew C O Todd

Abstract

Many IT installations have installed databases which are designed using principles which give insufficient weight to the ultimate cost of operation relative to other design criteria. Cost of operation as a design criterion is given minimal or scant consideration.

Codd's theory, through the concept of a fully normalised data structure, regards the maximisation of future flexibility as the principal objective in the design of a database. The systems development methodology, SSADM, reinforces this view by regarding the `normalised' structure as the `default' design structure. It regards the need to meet other client objectives such as fast on-line response and high levels of availability as secondary considerations and modifies the default structure in the light of such considerations.

This default structure, however, is also the most costly since each relation is mapped to a table and this results in the definition of the maximum number of tables. A SQL command can only be satisfied by accessing data from a potentially greater set of tables. Logical accesses increase the amount of CPU activity and physical accesses increase the amount of I/O activity.

Designs involving denormalisation consider the impacts of systematically limiting this degree of future flexibility relative to the cost implications by `consolidating' two or more relations into a single table. They provide the client with grounds for clearly limiting the need for future flexibility. Such `consolidation techniques' provide one tactic for achieving `Optimal Database Designs' (OD²).

OD² is a programme for the systematisation of techniques and tools for comprehensively defining database design criteria, the automatic evaluation of the performance of designs against such criteria and the adoption of adequate trade-off schemes.

Introduction

Practical experience leads us to the strong opinion that there is a need for better database designs since poor designs are still prevalent despite the widespread adoption of System Development Methodologies, Structured Design Concepts and Case Tools. Design guidelines for current approaches are emphasising such design criteria as full normalisation of the data structures and neglecting other important criteria.

Personal involvement in design reviews has revealed failures to consider such critical design criteria as:

  • a batch processing support capability - the need to determine the critical path of the overnight batch processes and to fit the total processing cycle into the overnight batch processing window.
  • re-organisability - ignoring the fact that a bank holiday weekend provides only 80 hours processing time for re-organisation and/or re-structuring of the database.
  • loadability - failing to predict that an initial design would take 33 days to load while an alternative design would take a tenth of that time.

Moreover there are further grounds for holding such a view.

  • An informal survey amongst UKCMG attendees at a previous conference revealed that on-line processes can often generate between 200 and 12000(!) physical I/Os as against a normal budget of 20.
  • The identification of a spend of £15000 on the development of a transaction which was deemed non-critical and would be run once per year.
  • Machine upgrades prompted and major systems modified as a result of inadequate initial designs.

Logical Database Design

Data base design is generally agreed to consist of a logical design stage followed by a more DBMS specific physical design stage [4.]. While there are two approaches that can be taken to the task of logical database design, normalisation and entity relationship modelling, there is an absence of formalised approaches to physical design.

Normalisation is a technique introduced [1. and 2.] as an extension to the relational model [3.]. It is a logical database design technique based on the application of progressive tests for fully associating attribute types with their appropriate relations. Full normalisation is a stopping rule which reduces a relation to its simplest and most atomic form by checking the correctness of a model which has been developed by more intuitive means.

The relations which are derived from normalisation can be analogised to the entities which result from entity relationship modelling or semantic modelling [5.]. However, while this latter technique has the advantage of providing a progressive top-down stepwise approach to the definition of candidate entities or relations, it does not have the scientific or mathematical respectability of normalisation. knee length prom dresses long sleeved evening dresses

The SSADM Methodology

The two approaches have been adopted by SSADM [6.] where they are regarded as complementary. Entity relationship modelling has a more significant role during the early phases of the system life cycle and normalisation during the later phases. It is used to confirm and refine the data model produced so far.

However, SSADM regards the fully normalised data structures not only as a default logical database design structure but as a potential default physical structure since it regards the maximisation of future flexibility as the principal objective in the design of a database. The methodology stresses the need and desirability of mapping the logical to the physical on a one-to-one basis (each entity or relation becomes a record or table and each relationship or linkage becomes a set/chain or foreign key linkage) and to deviate from the logical structure ONLY WHEN: ralph lauren wallets ralph lauren sunglasses men

  • the logical structure cannot be mapped to an appropriate physical database structure as might be the case if one were mapping to a hierarchical database such as DL/1
  • the performance goals which have been set by the end user clients are unlikely to be met and those end user clients are unwilling to agree to a change in the original performance goals.

Since most logical structures can now be mapped one-to-one directly to a network or relational database and since the criteria for defining performance continue to be viewed in a singularly narrow fashion, the goal of implementing a fully normalised structure is the default target. replik uhren hublot replicas de relojes rolex

Consequences

This default structure, however, is also the most costly since each relation is mapped to a table and this results in the definition of the maximum number of tables. A SQL command can only be satisfied by accessing data from a potentially greater set of tables. Logical accesses increase the amount of CPU activity and physical accesses increase the amount of I/O activity. Replica Patek Philippe Calatrava

Therefore, many IT installations have installed databases which are designed using principles which give overwhelming weight to future needs for flexibility and insufficient weight to the ultimate cost of operation relative to other design criteria. Cost of operation as a design criterion is given minimal or scant consideration.

The goal is not mitigated by the absence of techniques for the definition of performance evaluation criteria and for mapping from a logical to a physical design with a conscious awareness of their impacts on the evaluation criteria.

Denormalisation

Designs involving denormalisation consider the impacts of systematically limiting this degree of future flexibility relative to the cost implications by consolidating two or more relations into a single table. In fact the technique, which is a physical design technique, is more appropriately called `Consolidation'. It provides the designer with grounds for clearly defining and limiting the end user client's need for future flexibility. It is defined as a physical design technique because the authors do not wish to imply that normalisation as a logical design tool is inappropriate. It is a technique which is self-consciously applied following the completion of logical design and the establishment of a normalised structure.

Using examples from a vehicle registration design exercise, the discussion below considers each of the three main normalisation tests and demonstrates how the application of Consolidation at the physical design stage might alter the resultant physical structure.

The Normalisation Technique

There are normally three tests. The application of these various tests transforms the structure into what is referred to by Codd as Third Normal Form. In such a state, the structure will offer what might be regarded as the most robust option. It has the flexibility to cope with all the processing requirements which are likely to use that defined set of data.

1NF - Repeating Group Test

This test establishes whether there are there a variable number of occurrences of any attribute or group of attributes in any one occurrence of a relation. This is demonstrated by considering the following list of attributes for the vehicle relation which are keyed on the Vehicle Registration Mark:

Vehicle Registration Mark (Key)

Vehicle Type

Vehicle Make Code

Vehicle Make Name

Current Vehicle Owner Number

Current Vehicle Owner Name

Current Vehicle Owner Address

Current Vehicle Date of Registration

Previous Vehicle Owner Number

Previous Vehicle Owner Name

Previous Vehicle Owner Address

Previous Vehicle Date of Registration

The test requires that there is one (and only one) set of values associated with the vehicle and this list indicates that the name, address and date of registration is repeated for the current and previous owner. This situation must be avoided by the definition of two relations with the following lists of attributes:

Vehicle Registration Mark (Key)      
Vehicle Type
Vehicle Make Code
Vehicle Make Name


and

Vehicle Registration Mark (Partial Key 1)
Vehicle Owner Number (Partial Key 2)
Vehicle Owner Name
Vehicle Owner Address
Vehicle Date of Registration.

This structure has passed the first test and, intuitively, it can be seen that this structure is more robust since it permits an infinite number of ownerships to be recorded (rather than the previous two).

Consolidation considers those circumstances where this test might be subsequently overturned during physical design. Normally, the potential alternative structure which can be created is a single table with one or more columns if the nature of the linkage is one-to-one or if there is a finite, limited number of occurrences of the linkage.

For example, if the end user accepts that there is no need to record more than two ownerships for a vehicle, we see good reason for considering overturning this test during the physical design. In fact, the end user's logical view might well regard the two ownerships as two separate sets of attributes - current and previous ownership details. Additionally, the end user might not see a desirability in retaining an unlimited ownership history.

Normalisation would dictate that the following logical structure would be implemented by this physical structure.

1NF


Thus normalisation would dictate that the unnormalised table

Vehicle Registration Mark (Key)
Vehicle Type
Vehicle Make Code
Vehicle Make Name
Vehicle Colour 1
Vehicle Colour 2
Vehicle Colour 3


would be implemented as the two normalised tables

Vehicle Registration Mark (Key)
Vehicle Type
Vehicle Make Code
Vehicle Make Name


and

Vehicle Registration Mark (Key)
Vehicle Colour Code (Key)

However, if the end user defines and accepts that only two colours (a primary colour and a secondary colour) will be recorded per vehicle, one sees further good reason for considering overturning this test during the physical design.

In addition to a consideration of the flexibility criteria which are relevant to whether one does or does not adopt such structural options, one should also consider other relevant performance and design issues. These would include:

  • the volume of traffic between such tables
  • additional SQL definition and programming difficulties introduced by using such structures
  • whether the tables can exist independently of each other
  • the number and volatility of the extensions that might need to be created (since this is likely to affect space planning considerations)
  • special security considerations which apply to the normalised attributes.

2NF - Partial Key Dependence Test

In our previous example a new relation was created which had an identifying key of Vehicle Registration Mark and Vehicle Owner Number. The second test establishes whether it is possible to determine the value of any attribute in a relation if only part of the relation key is known.

Vehicle Registration Mark (Key)
Vehicle Owner Number (Key)
Vehicle Owner Name
Vehicle Owner Address
Vehicle Date of Registration

While we require both the vehicle registration mark and the owner number to establish the value of the date of registration, the values of both owner name and address can be determined solely from the owner number key. To avoid this situation this relation should be split into:

Vehicle Registration Mark (Key)
Vehicle Owner Number (Key)
Vehicle Date of Registration


and

Vehicle Owner Number (Key)  
Vehicle Owner Name
Vehicle Owner Address.


Again, intuitively we can see that this structure is more robust since it avoids the repetition and replication of owner name and address details in every ownership occurrence.

Whether such a logical structure can be implemented as a physical structure will depend upon whether an owner key exists. In the case of the Driver and Vehicle Licensing Authority at Swansea, this is impractical since no such owner key exists despite the fact that this might be deemed desirable and result in a more robust structure. Therefore, the resultant physical structure involves combining target attribute details on a repetitive basis with the relation. It introduces redundancy but this can be controlled and it may be much easier to cope with than the alternative problems raised by the definition and invention of a unique owner key.

2NF

In addition to the feasibility of introducing a unique key, among the criteria which might be relevant to this sort of decision are the volume of traffic between relations, the extra processing that might be needed to cope with the control of the repetitive data and the extra mass storage that is needed because of the redundancy introduced by the repeating data.

3NF - Non-key Dependence Test

This final test establishes whether it is possible to determine the value of any attribute in an relation if the value of some other attribute is known. To return to our vehicle relation which had been transformed following the first test, the following attributes are present.

Vehicle Registration Mark (Key)
Vehicle Type
Vehicle Make Code
Vehicle Make Name

This structure contains a multiple representation of make code and make name which will be replicated in every single occurrence of vehicle, but since the make name can be determined from the make code and is obviously dependant upon that make code, we should avoid this situation by the creation of a further relation. They will be:

Vehicle Registration Mark (Key)
Vehicle Type
Vehicle Make Code


and

Vehicle Make Code (Key)
Vehicle Make Name

The application of this test makes the structure more robust since it is now possible to record data about vehicle makes without the prior existence of a vehicle occurrence.

It is unlikely, however, that the analyst will encounter a situation where the make `code' will be present in the list of attributes so the strict application of this test is impossible. The analyst should therefore be prepared to seek out situations where the attribute is regarded as significant in itself and is determined by attributes which have not yet been encountered but whose presence would transform it into a relation in its own right.

There is an inherent danger that the data analyst will invent untold relations in pursuing the goal of third normal form. In fact, in a data base design exercise for a health insurance company, it was found that the data analysis had identified approximately 100 relations and that only 30 were implementable! Many were such obvious attributes as region and place names and the attributes had been transformed into relations by the invention of a region or place code. A goal of Consolidation would be to identify such `pseudo-relations' and to prevent their incorporation in the physical design.

Attribute Consolidation

Situations can also occur in a physical design exercise where it is appropriate to consider whether one or more of the attributes of a neighbouring relation should be included or duplicated in the target `relation'. So as to identify attributes which are candidates for such consolidation, one should have detailed data on the attribute usage by processes so that one is able to determine which attributes are used together most frequently.

Consider the following situation which indicates weight of traffic between relations and where we are attempting to establish whether attributes of `MOT Certification' and `Vehicle Make/Model' should be included in the `Vehicle' relation and where the relative volumes of traffic are so indicated.

Traffic

In this instance we see a considerable volume of traffic which is travelling from the vehicle relation to both neighbouring relations and there could well be a justification for including attributes from the adjacent relation in the target table. Inclusion of make and model name from the `Vehicle Make/Model' and MOT Date from `MOT certification' could save many physical accesses.

The Consolidation Technique

The data model which is constructed during data analysis and logical database design will be schematically represented by a two dimensional diagram. Similarly, the ultimate database design can also be represented in two dimensional form - in what is normally referred to as a Bachmann diagram. Thus, there must be a finite number of options which one can adopt which will transform the data model into a physical database design.

At one extreme one can implement the data model more or less as it stands with each relation being mapped one-to-one with a table and each relationship or foreign key linkage as a physical linkage. This is said to offer maximum flexibility and minimal redundancy but it incurs the cost of more complex and increased load and processing times - there is the need to incur additional file input/output operations and to maintain the linkages.

The combination or consolidation of two relations into one table provides the basis for performance improvements by retrieving both sets of data in one access operation. This avoids, not only the extra input/output overhead that an access involves, but also the instruction overhead - each database read/write operation will consume several thousand instructions whenever it is executed.

By progressive consolidation one might combine the relations and their component attributes into a series of structures resembling conventional flat files. This simplifies and speeds processing but minimizes flexibility and increases redundancy. In effect one is seeking a balance between these two extremes - basically speed of access versus flexibility.

Physical Database Design

Consolidation relies on regarding each relation on a chart connected to other relations. This can be termed the `target' relation. Assuming it has one to many linkages with other relations then these can be termed `owner' or `member' relations.

PDB

Given one wishes to consider consolidating this relation with another related relation we are basically able to consider the following options:

  • combine it or its attributes with its so called owner relation
  • combine it or its attributes with its so called member relation
  • do nothing and implement the linkages as they stand.

Redundancy

The Concept of Redundancy

The decision to consolidate two relations raises the possibility of:

  • consolidating the member into the owner and creating a relation with a fixed number of repeating columns i.e.. without redundancy but which is denormalised and more difficult to process
  • consolidating the owner data into each of the members and thereby creating redundancy for the sake of performance and processing simplicity.

Redundancy as a concept is simplistically regarded in horror. However the use of redundant data is a legitimate design aid whenever one can control the resultant redundancy. Specifically,

  • one must precisely plan and carefully control the various updating processes so that all versions of a piece of data can be simultaneously updated
  • one ensures that each item of data conforms to compatible formats
  • one regularly audits for inconsistency the physical locations at which they exist.

Types of Redundancy

One's attitude towards redundancy should also be conditioned by the type of redundancy which is to be created. It is worth distinguishing between data `repetition' and `duplication'. The difference can be demonstrated by the following instances.

In the case of data `repetition' a decision is taken to consolidate the attributes of the owner relation into a member relation - such as DISTRICT details into TEST-CTRE. Thus if our data model looks like this and we have the following occurrences.

Redundancy

We would create a structure which looked like this

   row 1      99      1    BRENTWOOD  
   row 2    99    2      CHELMSFORD
   row 3    99    3    COLCHESTER

The attribute values of the so-called `owner' relation are repeated in different logical records but in the same position i.e.. the first column. The latter remark is important for the distinction that is made between `repetition' and `duplication'.

Duplication is so named because the attribute values are held in different column positions in different rows. This situation is caused by combining or consolidating the attribute values and still retaining the original relation as a logical table.

Examples of this might be retaining the original relation DISTRICT as a table and consolidating the district code with TEST-CTRE and combining the make/model name from the MAKE/MODEL table with the vehicle.

While this distinction is only a matter of degree, there are nevertheless important control implications in its introduction. To ensure that no `repeated' occurrences get out of step we are able to audit them on the basis of a simple sequential pass. However, the situation is obviously more difficult to control with `duplication' since a simple sequential pass cannot suffice. Special processes need to be specifically developed to examine the values in their separate respective locations (in possibly different formats) and updating and recovery processes can also be more complex.

Limits to Flexibility

The consolidation technique also considers the limits to flexibility. The data analyst, in bowing down to the Godd of Normalisation, can invent relations which the end user might never need to consider in the lifetime of the system. In the data analysis exercise for a health insurance company, 20 of the 70 unimplementable relations were found to be addressing the need for data histories e.g.. analogous to the ownership history in the DVLC case. Therefore consolidation considers with the end user whether the target relation will be implemented now, at some future phase or never at all and whether future flexibility considerations will limit current design options.

One often encounters relations such as region and sub-region on the logical data model but it might be unnecessary to create a physical database of region or sub-region records. Often only a part of the relation such as the keys for such relations would appear and then they might well occur merely as a secondary key.

One must additionally attempt to identify those parts of the structure which are liable to be unstable and volatile so that, in the future, one is able to add `novel' structures to meet such unspecified but anticipated requirements. This is done by normalising the conceptual data structure and by systematically questioning whether the relations in that structure should be classified as volatile.

Optimal Database Design - OD²

The concept of the achievement of `Optimal Database Design' is viewed on two levels. On a long term level it is viewed as a project with the objective of defining a range of alternative design criteria, the use of tools which can automatically evaluate the performance of the design against such criteria and the development and adoption of adequate trade off schemes to maximise a `utility function'. Mathematical modelling strategies for maximising such a `utility function' are sought.

In the short term it is viewed as a packaging of `tactics'. A series of activities is planned so that OD² can be `sold' as a comprehensive programme for the systematisation of `tactics' - existing but novel techniques and tools for promoting more optimal designs. These `tactics' are based on a practical database design approach which has been developed over 15 years, which has been successfully used on designs undertaken throughout Europe and which has been applied to Hierarchical, Codasyl and Relational databases.

This paper on the `tactic' of denormalisation is the first deliverable in this series. Other `tactics' consider the use of DB maintained referential integrity, computer aided database design (CADD) support, traffic analysis techniques and the quantification of defined design criteria. The entire set of `tactics' will eventually be published as an MXG type guide.

The incorporation of automated modelling support is regarded as essential to the productivity of such an undertaking. The OD² `Consolidation' tactic provides the systematic means for defining the degree of denormalisation which might be implemented and determining the range of permissible options. SYS structure extraction tools provide a means whereby an existing database can be subjected to such OD² tuning `tactics'. Optimizer Simulation, Performance Evaluation and Simulation and Analytical Modelling Tools provide the means whereby such options can be explored.

A formal physical database design approach will avoid a failure to consider critical design criteria.

References

1. E. F. Codd (1971)

`Normalized Data Base Structure: A Brief Tutorial.' Proc 1971 ACM SIGFIDET Workshop on Data Description, Access and Control, San Diego, Calif. (November 1971)

2. E. F. Codd (1972)

`Further Normalization of the Data Base Relational Model' in `Data Base Systems Courant Computer Science Symposia Series Vol 6 Englewood Cliffs, N.J. Prentice Hall (1972)

3. E. F. Codd (1970)

`A Relational Model for Large Shared Data Banks' CACM 13 No 6 (June 1970)

4. C. J. Date (1990)

`An Introduction to Database Systems' Vol 1 Fifth Edition, System Programming Series, Addison Wesley (1990)

5. P. P. Chen (1976)

`The Entity Relationship Model - Towards a Unified View of Data' ACM TODS 1, No 1 (March 1976)

6. E. Downs, P. Clare, I. Coe (1992)

`Structured Systems Analysis and Design Method' 2nd Ed. Prentice-Hall

This paper was first presented at UKCMG 1993 - the annual conference ot the United Kingdom Computer Measurement Group held in Birmingham in May 1993.

Copyright © rests with Bernard W Bennetto,  Matthew C O Todd & Technology Translation Limited

Home
Home

Last updated: 19/04/97 20:55:17
Authored and Designed by:
Bernard W Bennetto   bennetto@tt-ltd.com

© Bernard W Bennetto 1996 - 1997