Basis for normalization of relations - Applications of Databases to Humanities and Social Sciences

The theory of normalization of relations is linked to the relational data model created by Codd in 1970. This model is based on the concept of relation.

Table des matières

The relational data model

Concept of relation

A relation is a 2 dimensional array, whose columns represent the properties of the relation (attributes, fields), and whose lines correspond to the various values (instances, occurrences) of these attributes. All the values of an attribute can belong only to one data type: integer, real, character string..., and all the lines must be different. This last definition involves the notion of relation key.

Key of a relation

The key to a relation is the minimum subset of attributes that uniquely identifies each line. All relations must necessarily have a key.

Schema of a Relation

It corresponds to the "condensed" definition of the relation. It is composed of the name of the relation followed by the list of the attributes which make this relation, the key attributes being underlined. For example, the relation R consisting of the attributes A1, A2 and A3 and whose schema is: R(A1,A2,A3).

The following table describes a possible "extension" of the relation:

A1	A2	A3
1	Wording 1	50
2	Wording 2	60
3	Wording 3	50
...

The concept of relation is simple and intuitive, but leads to numerous abnormalities if applied in a strict sense.

Lack of semantic connections among data.
Redundancy in the data.
Abnormal behavior of relations in case of updates: modifications, deletions.

To illustrate the various concepts presented, the following educational management attributes will be used:

Mnemonic	Wording	Type
Coefficient	Subject's coefficient.	Integer
Nomat	The name of the subject	String(30)
Nomens	The name of the teacher	String(30)
Numat	The subject's number	Integer
Numens	The teacher's number	Integer

Rules
We will assume that a teacher teaches only one subject, but that a subject can be taught by several teachers. We have five teachers per subject and ten subjects in education. We define, with these attributes, the EDUCATION relation.

EDUCATION(Numens, Nomens, Numat, Nomat, Coefficient)

"Problems" with the relation EDUCATION

We first notice that this relation fully meets the basic definition and can therefore be implemented directly in a relational DBMS. Attribute values belong to only one data type and the relation key (Numens) makes it possible to distinguish, in a single way, each line. But this relation, without structuring the information, presents numerous anomalies.

In terms of semantics.

The key in this relation is the number of the teacher (Numens). This means that this attribute identifies, characterizes the relation and thus all other attributes of this relation are semantically based properties of this attribute. This implies, among others, that the coefficient of a subject, or its name, are direct characteristics of a teacher, which is, of course, false. This modelling problem is due to the fact that the EDUCATION relation does not represent an object, an entity, of the real world but a combination of several entities.

In terms of data redundancy.

In the EDUCATION relation, the characteristics of a subject (its name and its coefficient) should be repeated as many times as there are teachers in the subject. In our case, five times. This repetition, which can be avoided, will create the risk of partially updating the data and thus the possibility of inconsistency.

The size, in characters, necessary to store this relation is the following: 5*10*(2+30+2+30+2)= 3300

We considered an integer to be two bytes.

At the updates level.

-It is impossible to enter the characteristics of a matter unless it is attached, at least, to a teacher. Indeed, it would mean that the key of the relation (Numens) would have NULL value which is forbidden by the constraint on the uniqueness of the key.

-If there is only one teacher in a subject, the departure of this teacher implies to delete all information about the teacher's relation, and in this case, the ones about the subject.

All these anomalies result from the fact that the EDUCATION relation is not in a "normal" form.

Normalization of relations

Setting in a "normal" form of relations, or normalization, aims to remove all the abnormal behaviors described above, and is based on the concept of functional dependency.

Functional dependency (FD)

We say that there is a functional dependency between an attribute A1 and an attribute A2, one notes A1 - > A2, if knowing the value of A1 we can associate only one value of A2. We also say that A1 determines A2. A1 is the source of the functional dependency and A2 the goal.

In our example, we have the following FDs:

Numens -> Nomens, Numat, Nomat, Coefficient
Numat -> Nomat, Coefficient

Elementary functional dependency

Functional dependency is said to be elementary if the source does not include unnecessary attributes. The question of FD elementary aspect should therefore only arise when the left side of the FD has more than one attribute.

In our example, all the FDs are elementary. In addition, if we have the following FDs:

A, B -> C
A -> C

The functional dependence A, B - > C is not elementary since B is unnecessary.

Direct functional dependency

We say that a functional dependency A - > B is direct if there is no attribute C such we can have A - > C and C - > B. This means that the dependency between A and B cannot be obtained by transitivity.

In our example, the FD: Numat - > Nomat, Coefficient is direct. in addition, the FD Numens - > Nomat is not direct.

We have: Numens -> Numat and Numat -> Nomat.

The normal forms

First normal form(1NF)

A relation is said to be in 1NF if all of its attributes cannot be subdivided.
The relation EDUCATION is in 1NF.

Second normal form (2NF)

We say that a relation is in 2NF if it is in 1NF and if all the FDs between the key and the other attributes are elementary.
The relation EDUCATION is in 2NF.

Third normal form (3NF)

We say that a relation is in 3NF if it is in 2NF and if all the FDs between the key and the other attributes are direct.
The relation EDUCATION is not in 3NF.

Boyce-Codd normal form (BCNF)

We say that a relation is in BCNF if it is in 3NF and if the only FDs which exist are those which connect the key to the other non-key attributes.
The relation EDUCATION is not in BCNF.

Relations in 3NF for our example

The problems of the EDUCATION relation arise because this relation is in 2NF. To eliminate this abnormal behavior, it is necessary to structure the attributes of this relation in order to divide it into several relations in 3NF. To construct the relation in 3NF relative to our example, we start from the set of FDs:

Numens -> Nomens, Numat, Nomat, Coefficient
Numat -> Nomat, Coefficient

After removing the transitivity, we obtain:

Numens -> Nomens, Numat
Numat -> Nomat, Coefficient

Which gives us the following relations:

TEACHER(Numens, Nomens, Numat)
SUBJECT(Numat, Nomat, Coefficient)

These two relations are in BCNF.

For detailed information on the design process, refer, in these pages, to the column DB Design.

Benefits of relations in 3NF (BCFN)

In terms of semantics.

The 2 previous relations correspond to objects, entities, of the real world. The attributes, properties of each relation are direct characteristics of the key of the related relation.

In terms of data redundancy.

The characteristic information of a subject (his name and coefficient) will be present only once and not 5 times compared to relation EDUCATION.
Information redundancy is therefore minimized, with no loss of information, limiting the risk of inconsistency.

Additionally, the size of the data is:

Size of the TEACHER relation : 5*10*(2+30+2)= 1700
Size of the SUBJECT relation: 10*(2+30+2)= 340

The size of the database is 2040 characters, so a reduction of about 40% compared to the database EDUCATION (3300 characters), without loss of information.

At the updates level.

-You can enter the characteristics of a subject even if it is not attached to a teacher. The capture is done only in the SUBJECT relation.

- If there is only one teacher in a subject, the departure of this teacher involves the deletion of information relating to this teacher (one line in the TEACHER relation), not those who relate to the subject.

The two relations in 3NF, TEACHER and SUBJECT, are therefore the "best" possible representation of the data in our example.

The construction of an "optimal" data structure consist in determining the relations in 3NF (BCNF), constructed from the whole set of the source data.

The database design process presented in these pages (DB Design), allows to obtain directly, relations in 3NF (BCNF).

------------------------------------
There are 4 other normal forms: 4 NF, 5 NF, DCNF, 6 NF that prevents some kinds of redundancy.
Designing databases that will have to be used in a real environment (hundreds of concurrent accesses, tens of tables made up of thousands of rows…), very often implies to make "adjustments" between the ideal structure, a purely theoretical approach, and the structure that will be implemented.
Designing databases in 3NF (BCNF) is thus a perfectly acceptable compromise that allows to construct coherent databases that can be used in a real environment.
------------------------------------

Back to structuring of data