Database Normalization — How it Works

Chase Grainger
12 min readMar 21, 2021
Photo by Fikri Rasyid on Unsplash

Intro

Redundant data may not be an issue for non-relational databases.

For relational databases, this is a pest that must be disposed of.

Data redundancy increases the difficulty of inserting, updating or deleting rows from a database consistently. Database normalization is a technique to organize data into multiple tables to eliminate data redundancy.

Systems that require specific information retrieval from a database benefit from the precision normalization provides. However, systems performing mass amounts of information retrieval may experience a decrease in access speed. The lack of data duplicity in normalized databases results in slower read times from the required use of table joins. The performance effects of database normalization will not be discussed further here, but it is worthwhile to mention that such ramifications should be considered when choosing to normalize a database. You can learn more on that here.

There are multiple synonyms utilized when discussing databases. Though terms used will remain consistent throughout this article, feel free to use the following as a reference when coming across additional sources.

Database synonyms found throughout other discussions related to database normalization
Examples of database synonyms

Keys in a Database Table

There exists numerous types of keys within a database table.

C# Corner defines a key in this way:

“A key is a unique data item or group of columns that exclusively identifies a row in a table.” — C# Corner

There can be any number of keys within a database table. The options available to uniquely identify a table row is proportional to the amount of keys present. Many different mechanisms exist to specify a single row in a table within many keys; less mechanisms exist in a table with few keys. A table without any keys cannot specify any row of data uniquely, making it difficult to perform accurate read or write operations.

Consider the following PERSON table:

PERSON database for simple explanations
Simple PERSON database table

In the PERSON table, three columns exist: Person_ID, First_Name and Last_Name. Note that only one of the columns can uniquely identify any individual table row. First_Name and Last_Name can’t uniquely identify a single person; neither can we identify a single person by combining First_Name and Last_Name into one key. Names are bound to overlap at some point — in a table of only three rows, two rows of data both have the same First_Name and Last_Name!

Only one column in the PERSON table can act as a key: Person_ID. This is because:

  1. There will never be two people with the same Person_ID
  2. Each column value for each table row can be obtained using Person_ID

Types of Database Keys

🔑 Super
  • A column or set set of columns used to uniquely identify all columns of a row
  • Are not always candidate keys

Super keys are the most general type of keys found in a table. Super keys can be created by:

- Using individual keys { Person_ID }

- Combining keys with other keys { Person_ID, SSN }

- Combing keys with columns that are not keys { Person_ID, First_Name }

Not all existing super keys will be useful in the database normalization process.

🔑 Candidate
  • Also known as a minimal super key, indicating the minimal amount of columns that can be used to identify a single table row
  • Are chosen from the list of super keys
  • Are super keys with no repeated columns

We may have two super keys from the super key description above:

1. { Person_ID }

2. { Person_ID, First_Name }

Only { Person_ID } is denotable as a candidate key. { Person_ID, First_Name } is not a minimal super key and consequently it is not a candidate key as { Person_ID, First_Name } unnecessarily contains the general column First_Name in its key definition.

🔑 Primary
  • Chosen from the list of candidate keys
  • Used to uniquely identify any table row

Only one primary key can exist in a database table. Person_ID is the only candidate key that exists in the PERSON table above, making it the natural choice as primary key.

🔑 Composite
  • Candidate key created by combining two or more columns from within a single table
  • Are not always candidate keys

Functional Dependency

When one column in a database table depends upon another column, functional dependency exists. The types of keys present in each table can help identify dependencies that are either beneficial or detrimental to the normalization of a database.

There are two specific types of negative functional dependency discussed in this article:

1. Partial Dependency

A partial dependency exists when a non-primary column depends upon a single column that is a part of a composite primary key.

An example of this type of dependency can be found in the 2NF explanation below.

2. Transitive Dependency

A transitive dependency exists when a non-primary column depends on some other non-primary column. There exists an indirect relationship between columns of the same table.

A transitive dependency only exists within tables that consist of at least three columns. Just as the transitive property in mathematics can only exist with three or more entities, the transitive property in database tables can only exist with three or more columns.

An example of this type of dependency can be found in the 3NF explanation below.

“Good“ Dependency

A positive form of functional dependency can exist.

In the PERSON table used earlier, Person_ID is the primary key. Both columns First_Name and Last_Name are dependent upon the Person_ID column for their values. We can say Person_ID determines both First_Name and Last_Name. This is denoted by:

Person_ID — -> First_Name

and

Person_ID — -> Last_Name

These two dependencies are acceptable because both non-primary columns depend upon the primary key for their values.

Additionally, each column in the PERSON table can be obtained by using the primary key, Person_ID.

Database Normal Forms (1NF, 2NF & 3NF)

As mentioned above, data redundancy causes insertion, deletion and modification anomalies. Database normalization is a technique to organize data into multiple tables to eliminate data redundancy.

Database normalization consists of six levels of progression. Each level is referred to as a normal form, only able to be reached after obtaining the previous. Second Normal Form (2NF) can only be reached after achieving First Normal Form (1NF), Third Normal Form can only be reached after achieving both Second Normal Form and First Normal Form, and so on. This process continues until Sixth Normal Form (6NF).

Forms beyond 4NF are generally pursued solely for academic research. The scope of this article focuses briefly on the first three normal forms: 1NF, 2NF & 3NF.

First Normal Form (1NF)

Five rules exist for a database to conform to 1NF:

  1. Each column must have a unique name
  2. The order of table rows does not matter
  3. Each table row in a column follows the same data format
  4. Each column contains purely atomic values
  5. No column is subverted (Name_1, Name_2, etc.)
1NF violation to 1NF violation
1NF Violation to 1NF Conformance

In the initial PROGRAMMERs table above, the Certifications column contains values that are not atomic; there is more than one value for at least one of the rows of data.

  • The first row contains two values: Python and Go
  • The third row contains two values: TypeScript and C#

Degree_1 and Degree_2 also exist as subverted columns. If a programmer obtained a third degree using the current approach, a third column, Degree_3, would have to be created. This increases the likelihood for data anomalies in the future.

The solution to both issues is to create two new tables:

  1. CERTIFICATIONS
  2. DEGREES

Any reference to a certification or degree may now be removed from the PROGRAMMERS table.

A reference to an individual programmer may now be maintained by using Prog_Id in both the CERTIFICATIONS and DEGREES table.

Second Normal Form (2NF)

There are two rules for a database to conform to 2NF:

  1. Already conforms to 1NF
  2. There exists no partial dependencies
Example of 2NF violation
2NF Violation Example

The WORKOUT table above conforms to 1NF. However, an issue exists with compliance to 2NF. One rule remains unfollowed: no partial dependencies allowed.

As previously stated, a partial dependency exists when a non-primary column depends on only part of a composite primary key, rather than the composite primary key in its entirety. In the WORKOUT table above, Workout_Type and Workout_Duration both depend on the Workout_ID column for its value. They do not depend on the { Athlete_ID, Workout_ID } key for its value, indicating a partial dependency. Athlete_Name also only depends upon Athlete_ID, rather than the { Athlete_ID, Workout_ID } key.

These partial dependency issues can be resolved by:

  1. Getting rid of the Athlete_Name column in the WORKOUT table
  2. Creating a new ATHLETE table
  3. Creating Athlete_ID and Athlete_Name columns in the ATHLETE table
  4. Simplifying Workout_Type and Workout_Duration to Type and Duration respectively in the WORKOUT table
Example of 2NF conformance
2NF Conformance Example

The dependencies from Workout_Type and Workout_Duration to Workout_ID are maintained; the dependency from Athlete_Name to Athlete_ID is also maintained — Second Normal Form has been achieved.

Third Normal Form (3NF)

There are two rules for a database to conform to 3NF:

  1. Conforms to 1NF and 2NF
  2. There exists no transitive dependencies
3NF Violation Example

In the MOVIE_INFO table above, we have information relating to two main entities:

  1. A movie
  2. A director

For the sake of this example, let’s assume the director’s salary is consistent across every movie directed.

In the MOVIE_INFO table, Director_ID is the primary key. The column values of Director_Name, Movie_Name and Director_Salary may each be obtained from the Director_ID column.

The MOVIE_INFO table above conforms to 2NF. However, an issue exists with compliance to 3NF.

In the MOVIE_INFO table, Movie_Name determines Director_Name. Director_Name determines the Director_Salary. This means that Movie_Name also determines Director_Salary. This indicates a transitive dependency, as there is an indirect relationship between columns in the same table; a non-primary column is able to determine another non-primary column.

This transitive dependency issue can be resolved by:

  1. Getting rid of the Movie_Name column in the MOVIE_INFO table
  2. Changing the MOVIE_INFO table name to DIRECTOR to keep movie and director entities in separate tables
  3. Creating a new MOVIE table
  4. Creating a Director_ID and Name column in the MOVIE table
3NF Conformance example
3NF Conformance Example

There is now no indirect relationship between Movie_Name and Director_Salary in the same database table. Our database conforms to 3NF.

How Database Normalization Makes Better Programmers

Through the normalization process, a database becomes organized and data redundancy is eliminated.

Normalization teaches:

  • Data should be atomic
  • Column names should not be subverted
  • Partial dependencies should not exist
  • Transitive dependencies should not exist

Database normalization accomplishes more than the act of increasing the quality of a database’s structure; by identifying key concepts behind normalization and applying it to programming, code quality improves dramatically.

Normalizing Structures

The results found when normalizing databases can be mirrored to results when normalizing structures within programs. Structures can include a number of constructs. In TypeScript, a prime example of a structure is an interface (More on TypeScript interfaces here).

A PROGRAMMERS interface can be created to mimic the unnormalized PROGRAMMERS table from the 1NF section above:

// Unnormalized structure analogous to a table within an
// unnormalized database
interface PERSON {
Prog_ID: number;
Name: String;
Certifications: String;
Degree_1: String;
Degree_2: String;
}

The issues that existed within the PROGRAMMERS table are analogous to the issues existing in this PROGRAMMERS interface:

  • Certifications allows only a single value while indicating plurality, violating atomicity
  • The degree property is subverted as Degree_1 and Degree_2

The solution to the unorganized structure of the PERSON interface can be fixed by adhering to the rules of 1NF. We can eliminate the 1NF violation by creating a separate interface for both certifications and degrees. Creating separate interfaces for violating attributes of an object is the same as creating separate tables for violating columns of a table. The normalized structure now looks like this:

// Normalized to 1NFinterface PERSON {
Prog_ID: number;
Name: String;
Certifications: Certification[];
Degree: Degree[];
}
interface Certification {
Prog_ID: number;
title: String;
}
interface Degree {
Prog_ID: number;
title: String;
}

Normalizing Logic

A database table with a partial dependency means a general column only depends on part of a composite primary key for its value. This can be represented through programming logic by part of a function return value depending on only a part of the function arguments.

Consider the following COFFEE_SNOB database table:

Table to represent normalizing logic
Logic normalization table representation

In the table above, notice that the Creamer column only depends on the Age column, and the Sugar column only depends on the Snob_Level column.

Now consider the following code, representing the same logic:

interface Coffee {
creamer: boolean;
sugar: boolean;
}
interface CoffeeSnob {
age: number;
snobLevel: number; // 1-10
}
function serveCoffeeSnob(snob: CoffeeSnob): Coffee {
let coffee: Coffee;
if (snob.age >= 60 && snob.snobLevel >= 8) {
coffee = { creamer: false, sugar: false };
} else if (snob.age >= 60 && snob.snobLevel < 8) {
coffee = { creamer: false, sugar: true };
} else if (snob.age < 60 && snob.snobLevel >= 8 ) {
coffee = { creamer: true, sugar: false };
} else {
coffee = { creamer: true, sugar: true };
}
return coffee;
}
const snob: CoffeeSnob = { age: 60, snobLevel: 10 };
const coffee = serveCoffeeSnob(snob); // without creamer or sugar

Ugly, right?

In the coffee function above, the creamer property of the returned coffee object is only dependent upon the age property of the snob function argument. Likewise, the sugar property of the returned coffee object is only dependent upon the snobLevel property of the snob function argument.

In a database table, the solution to partial dependency is to give the columns causing the dependency their own table. A similar solution exists in the coffee function; the assignment of both properties can be pulled out into their own expressions. These expressions are now located in the returned coffee object.

The result is cleaner, shorter code. 11 lines have been reduced to 4.

function coffee(snob: CoffeeSnob): Coffee {
return {
creamer: snob.age < 60,
sugar: snob.snobLevel < 8
};
}
const snob: CoffeeSnob = { age: 60, snobLevel: 10 };
const coffee = serveCoffeeSnob(snob); // without creamer or sugar

Conclusion

This article covers a wide range of material: types of keys within a database, the first three forms of database normalization and how the normalization process can result in higher-quality code.

For more information on the subject of database normalization, refer to the sources listed below.

Sources

(With helpful advice from Anthony Feldhake, Zachary Temple & Zechariah Schwerman)

--

--