Incorrect modelling issues. - Applications of Databases to Humanities and Social Sciences

When using an RDBMS, poor data modelling can have a significant impact on the relevance of the results produced. We will focus here on two problems that may arise when the modelling is not rigorous enough:

1-Questions about a scheme that is not in 3 NF

Relational database queries are carried out primarily using the SQL language. This language implements 2 fundamental operators:

Projection: allows you to select the columns of the relations
Join: allows you to join the columns of relations

The projection can be considered as a cutting (decomposition) operator of relations, and join as a recomposition (pasting) operator of relations. From a pictorial point of view, let us consider a piece of paper as a relation. The cutting of this sheet into vertical strips (projection) and then the gluing (joining) of the cut strips must allow, logically, to reconstruct the sheet of paper (the relation) in its initial state.

This seemingly obvious result is not obvious if the relations which make up the database are not in a proper structure.

For example, consider the following document that aggregates information about customers.

CUSTOMERS	Numcli	Nomcli	Address	Dept	Lbdept	Région
	1	Dupont	11 rue du Bois	33	Gironde	Aquitaine
	2	Martin	14 avenue Leclerc	33	Gironde	Aquitaine

Let us suppose that the modelling of this document creates the following two relations:

R1(Numcli, Nomcli, Dept)
R2(Dept, Lbdept, Région, Address)

R1 is the projection of CUSTOMERS on the attributes Numcli, Nomcli, Dept
R2 is the projection of CUSTOMERS on the attributes Dept, Lbdept, Region, Address

R1	Numcli	Nomcli	Dept	R2	Dept	Lbdept	Région	Address
	1	Dupont	33		33	Gironde	Aquitaine	11 rue du Bois
	2	Martin	33		33	Gironde	Aquitaine	14 avenue Leclerc

If we make a join of R1 and R2 on the attribute they have in common (Dept), we should logically find the CUSTOMERS relation.
However, the result of the join is as follows:

R1 x R2	Numcli	Nomcli	Dept	Lbdept	Région	Address
	1	Dupont	33	Gironde	Aquitaine	11 rue du Bois
	1	Dupont	33	Gironde	Aquitaine	14 avenue Leclerc
	2	Martin	33	Gironde	Aquitaine	11 rue du Bois
	2	Martin	33	Gironde	Aquitaine	14 avenue Leclerc

This means that if R1 and R2 relations were implemented as is in an RDBMS, the SQL query join would give results unrelated to reality…

This abnormal behavior (decomposition with lost of information) is due to the fact that R2 is not in 3 NF. In fact, the elementary and direct FDs (see: normalization of relations) that link the various attributes are as follows:

Numcli -> Nomcli, Address, Dept
Dept -> Lbdept, Région

It follows that we have the following relations in 3 FNs:

CUSTOMER(Numcli, Nomcli, Address, Dept)
DEPARTEMENT(Dept, Lbdept, Région)

It will be easy to verify that the join of CUSTOMER and DEPARTEMENT restores the original CUSTOMERS relation. The decomposition of CUSTOMERS into 2 relations CUSTOMER and DEPARTMENT is therefore done without loss of information.

Improper modeling (relations that are not in 3 NFs) may therefore have a significant impact on the relevance of the results produced.

2-Arrays modeling

Many people are often confused by the database approach. They do not find a structure allowing the direct implementation of an array of data already filled as it is possible in specialized software such as SAS, Minitab, Excel or other quantitative tools.

The modeling of an array in a database is carried out by the implementation of a relation (table) whose key is constituted by all the indexes of the array and the property is the value associated with each combination of the indexes.
If we consider an array A, consisting of i indexes (i1, i2…), with for each combination of indexes the value V, the relation in 3 NF that allows modeling A is the following:

A(i1, i2,…, V)

This structure relates each combination of i indexes to its corresponding value. For instance, for a contingency array with 10 rows and 5 columns, the database structure will consist of 50 (10*5) rows.

Although RDBMSs are not designed to process data arrays, many RDBMSs (Oracle, MySQL, Firebird...) have added features to operate with arrays. Data arrays can then be defined directly in the RDBMS.
However, the array data query cannot be done directly with SQL and has to be done using a programming language, the "stored procedures" in the case of Firebird.

Example
Let us take the example of managing the employees of a company. For every employee, we want the ability to store information like:

His employee number, his name, his date of birth, as well as the languages, other than his natural language, he knows (from 1 to 5) and the level of mastery he has of each language: beginner, confirmed, expert.

1-Working with arrays
The Language and Level attributes will be described as arrays which may contain 1 to 5 elements. Using, for instance, the formalism of the Firebird RDBMS, we would have a relation (table) of the following form:

EMPLOYEE(Numemp, Nomemp, Dtnaiss, Language[1..5], Level[1..5])

Remarks
-Even if the employee does not know languages other than his natural language, the attributes Language and Level will be reserved 5 times.
-If an employee speaks more than 5 languages, the information cannot be saved.
-To display the languages, and the language level of each employee it is necessary to use a programming language that will iterate on the possible values of the Language and Level attributes.
To process this last point, you need to use, for example with Firebird, a stored procedure like this:

possible, more often, toDECLARE VARIABLE i INTEGER;
BEGIN
i = 1;
WHILE (i <= 5) DO
BEGIN
SELECT langue[:i], Niveau[:i] FROM EMPLOYE
WHERE (Numemp=:code)
INTO :languages, :niveaux;
IF (languages = ' ') THEN languages = 'NULL';
i = i +1;
SUSPEND;
END
END

2-Modeling without array
In this example, it is easy to avoid the use of arrays. Using arrays is an incorrect modelling. To manage this example, it is enough to define the following 2 relations (tables) in 3 NFs:

EMPLOYEE(Numemp, Nomemp, Dtnaiss)
LANGUAGES(Numemp,Language, Level)

Remarks
With this modeling:
-If the employee does not know languages other than his natural language, no rows will appear in the LANGUAGES table for the employee.
-If an employee is fluent in more than 5 languages, the information can be recorded easily.
-To display the languages and language level of each employee, all you have to do is make an SQL query that performs a join between EMPLOYEE and LANGUAGES on the Numemp attribute, to have the desired information.

Conclusion
The use of arrays, although possible with a RDBMS, must be avoided. Proper modeling makes it possible, most often, to get rid of the use of arrays.