Basis for normalization of relations.

The theory of normalization is related to the relational data model set up by Codd in 1970. This model is based on the concept of relation.

The relational data model
  • Concept of relation

A relation is a 2 dimensional array, whose columns represent the properties of the relation (attributes or fields), and whose lines correspond to the various values (instances, occurrences) of these attributes. All the values of an attribute can belong only to one type: integer, real, character string..., and all the lines must be different, from where the concept of key of a relation.

  • Key of a relation

The key of a relation is the minimal subset of attributes that allows each row to be identified in a single way. All relations must obligatorily have a key.

  • Schema of a Relation

It corresponds to the "condensed" definition of the relation. It is composed of the name of the relation followed by the list of the attributes which make this relation, the key attributes being underlined. For example, the relation R consisting of the attributes A1, A2 and A3 whose schema is : R(A1,A2,A3).

The following table describes a possible "extension" of the relation:

A1 A2 A3
1 Wording 1 50
2 Wording 2 60
3 Wording 3 50
...    

The concept of relation is simple and intuitive, but leads to many anomalies if one applies it in a strict sense.

  • Lack of semantic connections between data.
  • Redundancy of data.
  • "Abnormal" behaviors of the relations in case of updates: modifications, deletions.

To illustrate the various concepts presented, we will use the following attributes that are relevant to educational management:

Mnemonic Wording Type
Coefficient Subject's coefficient. Integer
Nomat The name of the subject String(30)
Nomens The name of the teacher String(30)
Numat The subject's number Integer
Numens The teacher's number Integer

 

We will assume that a teacher teaches only one subject, but one subject can be taught by more than one teacher. There are 5 teachers for each subject and there are 10 subjects in the class. We define, with these attributes, the EDUCATION relation.

EDUCATION(Numens, Nomens, Numat, Nomat, Coefficient)

"Problems" with the relation EDUCATION

We first notice that this relation fully meets the basic definition and can therefore be implemented and used in a relational DBMS. The values of the attributes belong to only one type and the key of the relation (Numens) makes it possible to distinguish, in a unique way, each line.
However, this relation, without structuring the information, has numerous anomalies.

  • In terms of semantics.

The key in this relation is the number of the teacher (Numens). This means that this attribute identifies, characterizes, the relation and thus that all the other attributes of this relation are properties based semantically on this attribute. This implies, among others, that the coefficient of a subject, or his name, where the direct characteristics of a teacher, which is, obviously, erroneous.
This modelling problem is due to the fact that the EDUCATION relation does not represent an object, an entity, of the real world but a combination of several entities.

  • In terms of data redundancy.

In the EDUCATION relation, the characteristics of a subject (its name and its coefficient) should be repeated as many times as there are teachers in the subject. In our case, five times. This repetition, which can be avoided, will create the risk of partial updating of the data and therefore the possibility of inconsistency.

The size, in characters, necessary to store this relation is the following: 5*10*(2+30+2+30+2)= 3300

We considered an integer to be two bytes.

  • At the updates level.

-It is impossible to enter into the characteristics of a matter unless it is attached, at least, to a teacher. Indeed, this would mean that the key of the relation (Numens) would have value NULL which is prohibited by the constraint on the uniqueness of the key.

-If there is only one teacher in a subject, the departure of this teacher implies the suppression of all information about the teacher's relation, and in this case, the ones about the subject.

All these anomalies result from the fact that the EDUCATION relation is not in a "normal" form.

Normalization of the relations

Setting in a "normal" form of relations, or normalization, aims to remove all the abnormal behaviors described above, and is based on the concept of functional dependency.

  • Functional dependency (FD)

We say that there is a functional dependency between an attribute A1 and an attribute A2, one notes A1 - > A2, if knowing the value of A1 we can associate only one value of A2. We also say that A1 determines A2. A1 is the source of the functional dependency and A2 the goal.

In our example, we have the following FDs:

Numens -> Nomens, Numat, Nomat, Coefficient
Numat -> Nomat, Coefficient

  • Elementary functional dependency

Functional dependency is said to be elementary if the source does not include unnecessary attributes. The question of FD elementary aspect should therefore only arise when the left side of the FD has more than one attribute.

In our example, all the FDs are elementary. In addition, if we have the following FDs:

A, B -> C
A -> C

The functional dependence A, B - > C is not elementary since B is unnecessary.

  • Direct functional dependency

We say that a functional dependency A - > B is direct if there is no attribute C such we can have A - > C and C - > B. In other words, that means that the dependency between A and B cannot be obtained by transitivity.

In our example, the FD: Numat - > Nomat, Coefficient is direct. in addition, the FD Numens - > Nomat is not direct.

We have: Numens -> Numat and Numat -> Nomat.

The normal forms
  • First normal form(1NF)

A relation is said to be in 1NF if all of its component attributes cannot be subdivided.
The relation EDUCATION is in 1NF.

  • Second normal form (2NF)

We say that a relation is in 2NF if it is in 1NF and if all the FDs between the key and the other attributes are elementary.
The relation EDUCATION is in 2NF.

  • Third normal form (3NF)

We say that a relation is in 3NF if it is in 2NF and if all the FDs between the key and the other attributes are elementary and direct.
The relation EDUCATION is not in 3NF.

  • Boyce-Codd normal form (BCNF)

We say that a relation is in BCNF if it is in 3NF and if the only FDs which exist are those which connect the key to the other non-key attributes.
The relation EDUCATION is not in BCNF.

Relations in 3NF for our example

The problems of the EDUCATION relation arise because this relation is in 2NF. To eliminate the abnormal behavior, it is necessary to structure the attributes of this relation so as to divide it into several relation in 3NF.
To construct the relation in 3NF relative to our example, we start from the set of FDs:

Numens -> Nomens, Numat, Nomat, Coefficient
Numat -> Nomat, Coefficient

After removing the transitivity, we obtain:

Numens -> Nomens, Numat
Numat -> Nomat, Coefficient

Which gives us the following relations:

TEACHER(Numens, Nomens, Numat)
SUBJECT(Numat, Nomat, Coefficient)

These two relations are in BCNF.

For detailed information on the design process, refer, in these pages, to the column DB Design.

Benefits of relations in 3NF (BCFN)
  • In terms of semantics.

The 2 previous relations correspond to objects, entities, of the real world. The attributes, properties of each relation are direct characteristics of the key of the related relation.

  • In terms of data redundancy.

The characteristic information of a subject (his name and coefficient) will be present only once and not 5 times compared to relation EDUCATION.
Information redundancy is therefore minimized, with no loss of information, limiting the risk of inconsistency.

Additionally, the size of the data is:

Size of the TEACHER relation : 5*10*(2+30+2)= 1700
Size of the SUBJECT relation: 10*(2+30+2)= 340

The size of the database is 2040 characters, so a reduction of about 40% compared to the database EDUCATION (3300 characters), without loss of information.

  • At the updates level.

-You can enter the characteristics of a material even if it is not attached to a teacher. The capture is done only in the SUBJECT relation.

- If there is only one teacher in a subject, the departure of this teacher involves the deletion of information relating to this teacher (one line in the TEACHER relation), not those who relate to the subject.

The two relations in 3NF, TEACHER and SUBJECT, are therefore the "best" possible representation of the data in our example.

The construction of an "optimal" data structure consist in determining the relations in 3NF (BCNF), constructed from the whole set of the source data.

The database design process presented in these pages (DB Design), allows to obtain directly, relations in 3NF (BCNF).

------------------------------------
There are 4 other normal forms: 4 NF, 5 NF, DCNF, 6 NF that prevents certain types of redundancy.
Designing databases that will have to be used in a real environment (hundreds of concurrent accesses, tens of tables made up of thousands of rows…), very often implies to make "adjustments" between the ideal structure, a purely theoretical approach, and the structure that will be implemented.
Designing databases in 3NF (BCNF) is thus a perfectly acceptable compromise that allows to construct coherent databases that can be used in a real environment.
------------------------------------

Back to structuring of data