Database Normalization — How it Works
Intro
Redundant data may not be an issue for non-relational databases.
For relational databases, this is a pest that must be disposed of.
Data redundancy increases the difficulty of inserting, updating or deleting rows from a database consistently. Database normalization is a technique to organize data into multiple tables to eliminate data redundancy.
Systems that require specific information retrieval from a database benefit from the precision normalization provides. However, systems performing mass amounts of information retrieval may experience a decrease in access speed. The lack of data duplicity in normalized databases results in slower read times from the required use of table joins. The performance effects of database normalization will not be discussed further here, but it is worthwhile to mention that such ramifications should be considered when choosing to normalize a database. You can learn more on that here.
There are multiple synonyms utilized when discussing databases. Though terms used will remain consistent throughout this article, feel free to use the following as a reference when coming across additional sources.
Keys in a Database Table
There exists numerous types of keys within a database table.
C# Corner defines a key in this way:
“A key is a unique data item or group of columns that exclusively identifies a row in a table.” — C# Corner
There can be any number of keys within a database table. The options available to uniquely identify a table row is proportional to the amount of keys present. Many different mechanisms exist to specify a single row in a table within many keys; less mechanisms exist in a table with few keys. A table without any keys cannot specify any row of data uniquely, making it difficult to perform accurate read or write operations.
Consider the following PERSON
table:
In the PERSON
table, three columns exist: Person_ID
, First_Name
and Last_Name
. Note that only one of the columns can uniquely identify any individual table row. First_Name
and Last_Name
can’t uniquely identify a single person; neither can we identify a single person by combining First_Name
and Last_Name
into one key. Names are bound to overlap at some point — in a table of only three rows, two rows of data both have the same First_Name
and Last_Name
!
Only one column in the PERSON
table can act as a key: Person_ID
. This is because:
- There will never be two people with the same
Person_ID
- Each column value for each table row can be obtained using
Person_ID
Types of Database Keys
🔑 Super
- A column or set set of columns used to uniquely identify all columns of a row
- Are not always candidate keys
Super keys are the most general type of keys found in a table. Super keys can be created by:
- Using individual keys { Person_ID }
- Combining keys with other keys { Person_ID, SSN }
- Combing keys with columns that are not keys { Person_ID, First_Name }
Not all existing super keys will be useful in the database normalization process.
🔑 Candidate
- Also known as a minimal super key, indicating the minimal amount of columns that can be used to identify a single table row
- Are chosen from the list of super keys
- Are super keys with no repeated columns
We may have two super keys from the super key description above:
1. { Person_ID }
2. { Person_ID, First_Name }
Only { Person_ID }
is denotable as a candidate key. { Person_ID, First_Name }
is not a minimal super key and consequently it is not a candidate key as { Person_ID, First_Name }
unnecessarily contains the general column First_Name
in its key definition.
🔑 Primary
- Chosen from the list of candidate keys
- Used to uniquely identify any table row
Only one primary key can exist in a database table. Person_ID
is the only candidate key that exists in the PERSON
table above, making it the natural choice as primary key.
🔑 Composite
- Candidate key created by combining two or more columns from within a single table
- Are not always candidate keys
Functional Dependency
When one column in a database table depends upon another column, functional dependency exists. The types of keys present in each table can help identify dependencies that are either beneficial or detrimental to the normalization of a database.
There are two specific types of negative functional dependency discussed in this article:
1. Partial Dependency
A partial dependency exists when a non-primary column depends upon a single column that is a part of a composite primary key.
An example of this type of dependency can be found in the 2NF explanation below.
2. Transitive Dependency
A transitive dependency exists when a non-primary column depends on some other non-primary column. There exists an indirect relationship between columns of the same table.
A transitive dependency only exists within tables that consist of at least three columns. Just as the transitive property in mathematics can only exist with three or more entities, the transitive property in database tables can only exist with three or more columns.
An example of this type of dependency can be found in the 3NF explanation below.
“Good“ Dependency
A positive form of functional dependency can exist.
In the PERSON
table used earlier, Person_ID
is the primary key. Both columns First_Name
and Last_Name
are dependent upon the Person_ID
column for their values. We can say Person_ID
determines both First_Name
and Last_Name
. This is denoted by:
Person_ID
— -> First_Name
and
Person_ID
— -> Last_Name
These two dependencies are acceptable because both non-primary columns depend upon the primary key for their values.
Additionally, each column in the PERSON
table can be obtained by using the primary key, Person_ID
.
Database Normal Forms (1NF, 2NF & 3NF)
As mentioned above, data redundancy causes insertion, deletion and modification anomalies. Database normalization is a technique to organize data into multiple tables to eliminate data redundancy.
Database normalization consists of six levels of progression. Each level is referred to as a normal form, only able to be reached after obtaining the previous. Second Normal Form (2NF) can only be reached after achieving First Normal Form (1NF), Third Normal Form can only be reached after achieving both Second Normal Form and First Normal Form, and so on. This process continues until Sixth Normal Form (6NF).
Forms beyond 4NF are generally pursued solely for academic research. The scope of this article focuses briefly on the first three normal forms: 1NF, 2NF & 3NF.
First Normal Form (1NF)
Five rules exist for a database to conform to 1NF:
- Each column must have a unique name
- The order of table rows does not matter
- Each table row in a column follows the same data format
- Each column contains purely atomic values
- No column is subverted (
Name_1
,Name_2
, etc.)
In the initial PROGRAMMERs
table above, the Certifications
column contains values that are not atomic; there is more than one value for at least one of the rows of data.
- The first row contains two values:
Python
andGo
- The third row contains two values:
TypeScript
andC#
Degree_1
and Degree_2
also exist as subverted columns. If a programmer obtained a third degree using the current approach, a third column, Degree_3
, would have to be created. This increases the likelihood for data anomalies in the future.
The solution to both issues is to create two new tables:
CERTIFICATIONS
DEGREES
Any reference to a certification or degree may now be removed from the PROGRAMMERS
table.
A reference to an individual programmer may now be maintained by using Prog_Id
in both the CERTIFICATIONS
and DEGREES
table.
Second Normal Form (2NF)
There are two rules for a database to conform to 2NF:
- Already conforms to 1NF
- There exists no partial dependencies
The WORKOUT
table above conforms to 1NF. However, an issue exists with compliance to 2NF. One rule remains unfollowed: no partial dependencies allowed.
As previously stated, a partial dependency exists when a non-primary column depends on only part of a composite primary key, rather than the composite primary key in its entirety. In the WORKOUT
table above, Workout_Type
and Workout_Duration
both depend on the Workout_ID
column for its value. They do not depend on the { Athlete_ID, Workout_ID }
key for its value, indicating a partial dependency. Athlete_Name
also only depends upon Athlete_ID
, rather than the { Athlete_ID, Workout_ID }
key.
These partial dependency issues can be resolved by:
- Getting rid of the
Athlete_Name
column in theWORKOUT
table - Creating a new
ATHLETE
table - Creating
Athlete_ID
andAthlete_Name
columns in theATHLETE
table - Simplifying
Workout_Type
andWorkout_Duration
toType
andDuration
respectively in theWORKOUT
table
The dependencies from Workout_Type
and Workout_Duration
to Workout_ID
are maintained; the dependency from Athlete_Name
to Athlete_ID
is also maintained — Second Normal Form has been achieved.
Third Normal Form (3NF)
There are two rules for a database to conform to 3NF:
- Conforms to 1NF and 2NF
- There exists no transitive dependencies
In the MOVIE_INFO
table above, we have information relating to two main entities:
- A movie
- A director
For the sake of this example, let’s assume the director’s salary is consistent across every movie directed.
In the MOVIE_INFO
table, Director_ID
is the primary key. The column values of Director_Name
, Movie_Name
and Director_Salary
may each be obtained from the Director_ID
column.
The MOVIE_INFO
table above conforms to 2NF. However, an issue exists with compliance to 3NF.
In the MOVIE_INFO
table, Movie_Name
determines Director_Name
. Director_Name
determines the Director_Salary
. This means that Movie_Name
also determines Director_Salary
. This indicates a transitive dependency, as there is an indirect relationship between columns in the same table; a non-primary column is able to determine another non-primary column.
This transitive dependency issue can be resolved by:
- Getting rid of the
Movie_Name
column in theMOVIE_INFO
table - Changing the
MOVIE_INFO
table name toDIRECTOR
to keep movie and director entities in separate tables - Creating a new
MOVIE
table - Creating a
Director_ID
and Name column in theMOVIE
table
There is now no indirect relationship between Movie_Name
and Director_Salary
in the same database table. Our database conforms to 3NF.
How Database Normalization Makes Better Programmers
Through the normalization process, a database becomes organized and data redundancy is eliminated.
Normalization teaches:
- Data should be atomic
- Column names should not be subverted
- Partial dependencies should not exist
- Transitive dependencies should not exist
Database normalization accomplishes more than the act of increasing the quality of a database’s structure; by identifying key concepts behind normalization and applying it to programming, code quality improves dramatically.
Normalizing Structures
The results found when normalizing databases can be mirrored to results when normalizing structures within programs. Structures can include a number of constructs. In TypeScript, a prime example of a structure is an interface (More on TypeScript interfaces here).
A PROGRAMMERS
interface can be created to mimic the unnormalized PROGRAMMERS
table from the 1NF section above:
// Unnormalized structure analogous to a table within an
// unnormalized databaseinterface PERSON {
Prog_ID: number;
Name: String;
Certifications: String;
Degree_1: String;
Degree_2: String;
}
The issues that existed within the PROGRAMMERS
table are analogous to the issues existing in this PROGRAMMERS
interface:
Certifications
allows only a single value while indicating plurality, violating atomicity- The degree property is subverted as
Degree_1
andDegree_2
The solution to the unorganized structure of the PERSON
interface can be fixed by adhering to the rules of 1NF. We can eliminate the 1NF violation by creating a separate interface for both certifications and degrees. Creating separate interfaces for violating attributes of an object is the same as creating separate tables for violating columns of a table. The normalized structure now looks like this:
// Normalized to 1NFinterface PERSON {
Prog_ID: number;
Name: String;
Certifications: Certification[];
Degree: Degree[];
}interface Certification {
Prog_ID: number;
title: String;
}interface Degree {
Prog_ID: number;
title: String;
}
Normalizing Logic
A database table with a partial dependency means a general column only depends on part of a composite primary key for its value. This can be represented through programming logic by part of a function return value depending on only a part of the function arguments.
Consider the following COFFEE_SNOB
database table:
In the table above, notice that the Creamer
column only depends on the Age
column, and the Sugar
column only depends on the Snob_Level
column.
Now consider the following code, representing the same logic:
interface Coffee {
creamer: boolean;
sugar: boolean;
}interface CoffeeSnob {
age: number;
snobLevel: number; // 1-10
}function serveCoffeeSnob(snob: CoffeeSnob): Coffee {
let coffee: Coffee;
if (snob.age >= 60 && snob.snobLevel >= 8) {
coffee = { creamer: false, sugar: false };
} else if (snob.age >= 60 && snob.snobLevel < 8) {
coffee = { creamer: false, sugar: true };
} else if (snob.age < 60 && snob.snobLevel >= 8 ) {
coffee = { creamer: true, sugar: false };
} else {
coffee = { creamer: true, sugar: true };
}
return coffee;
}const snob: CoffeeSnob = { age: 60, snobLevel: 10 };
const coffee = serveCoffeeSnob(snob); // without creamer or sugar
Ugly, right?
In the coffee
function above, the creamer
property of the returned coffee
object is only dependent upon the age property of the snob function argument. Likewise, the sugar
property of the returned coffee object is only dependent upon the snobLevel
property of the snob
function argument.
In a database table, the solution to partial dependency is to give the columns causing the dependency their own table. A similar solution exists in the coffee
function; the assignment of both properties can be pulled out into their own expressions. These expressions are now located in the returned coffee
object.
The result is cleaner, shorter code. 11 lines have been reduced to 4.
function coffee(snob: CoffeeSnob): Coffee {
return {
creamer: snob.age < 60,
sugar: snob.snobLevel < 8
};
}const snob: CoffeeSnob = { age: 60, snobLevel: 10 };
const coffee = serveCoffeeSnob(snob); // without creamer or sugar
Conclusion
This article covers a wide range of material: types of keys within a database, the first three forms of database normalization and how the normalization process can result in higher-quality code.
For more information on the subject of database normalization, refer to the sources listed below.
Sources
(With helpful advice from Anthony Feldhake, Zachary Temple & Zechariah Schwerman)