Thursday, April 16, 2009

Database Normalization

What is Normalization?


Normalization is the process of efficiently organizing data in a database.

There are two goals of the normalization process:
    • eliminating redundant data (for example, storing the same data in more than one table)
    • ensuring data dependencies make sense (only storing related data in a table).

Both of these are worthy goals as they reduce the amount of space a database consumes and ensure that data is logically stored.

The concept of database normalization is not unique to any particular Relational
Database Management System. It can be applied to any of several implications
of relational databases including Microsoft Access, dBase, Oracle, etc. The
benefits of Normalizing your database include:


  • Avoiding repetitive entries
  • Reducing required storage space
  • Preventing the need to restructure existing tables to accommodate new data.
  • Increased speed and flexibility of queries, sorts, and summaries.


The Normal Forms


The database community has developed a series of guidelines for ensuring that databases are normalized. These are referred to as normal forms and are numbered from one (the lowest form of normalization, referred to as first normal form or 1NF) through five (fifth normal form or 5NF).

In practical applications, you'll often see 1NF, 2NF, and 3NF along with the occasional 4NF. Fifth normal form is very rarely seen

  • In order to reach peak efficiency, it is recommended that relational databases be normalized through at least the third normal form.
  • In order to normalize a database, each table should have a primary key field that uniquely identifies each record in that table. A primary key can consist of a single field (an ID Number field for instance) or a combination of two or more fields that together make a unique key (called a multiple field primary key).



First Normal Form (1NF)


First normal form (1NF) sets the very basic rules for an organized database:

  • Eliminate duplicative columns from the same table.
  • Create separate tables for each group of related data and identify each row with a unique column or set of columns (the primary key).

For a table to be in first normal form, data must be broken up into the smallest units possible.

For example, the following table is not in first normal form.


Name Address Phone
Sally Singer 123 Broadway New York, NY, 11234 (111) 222-3345
Jason Jumper 456 Jolly Jumper St. Trenton NJ, 11547 (222) 334-5566


To conform to first normal form, this table would require additional fields.

    • The name field should be divided into first and last name and the address should

be divided by street, city state, and zip like this.


ID First Last Street City State Zip Phone
564 Sally Singer 123 Broadway New York NY 11234 (111) 222-3345
565 Jason Jumper 456 Jolly Jumper St. Trenton NJ 11547 (222) 334-5566


    • In addition to breaking data up into the smallest meaningful values, tables
in first normal form should not contain repetitions groups of fields such as
in the following table.


Rep ID Representative Client 1 Time 1 Client 2 Time 2 Client 3 Time 3
TS-89 Gilroy Gladstone US Corp. 14 hrs Taggarts 26 hrs Kilroy Inc. 9 hrs
RK-56 Mary Mayhem Italiana 67 hrs Linkers 2 hrs


The problem here is that each representative can have multiple clients not all will have three. Some may have less as is the case in the second record, tying up storage space in your database that is not being used, and some may have more, in which case there are not enough fields. The solution to this is to add a record for each new piece of information.


Rep ID Rep First Name Rep Last Name Client Time With Client
TS-89 Gilroy Gladstone US Corp 14 hrs
TS-89 Gilroy Gladstone Taggarts 26 hrs
TS-89 Gilroy Gladstone Kilroy Inc. 9 hrs
RK-56 Mary Mayhem Italiana 67 hrs
RK-56 Mary Mayhem Linkers 2 hrs

*Notice the splitting of the first and last name fields again.



This table is now in first normal form. Note that by avoiding repeating groups of fields, we have created a new problem in that there are identical values in the primary key field, violating the rules of the primary key. In order to remedy this, we need to have some other way of identifying each record. This can be done with the creation of a new key called client ID.


Rep ID* Rep First Name Rep Last Name Client ID* Client Time With Client
TS-89 Gilroy Gladstone 978 US Corp 14 hrs
TS-89 Gilroy Gladstone 665 Taggarts 26 hrs
TS-89 Gilroy Gladstone 782 Kilroy Inc. 9 hrs
RK-56 Mary Mayhem 221 Italiana 67 hrs
RK-56 Mary Mayhem 982 Linkers 2 hrs


This new field can now be used in conjunction with the Rep ID field to create a multiple field primary key. This will prevent confusion if ever more than one Representative were to serve a single client.



Second Normal Form (2NF)


Second normal form (2NF) further addresses the concept of removing duplication data:
  • Meet all the requirements of the first normal form.
  • Remove subsets of data that apply to multiple rows of a table and place them in separate tables.
  • Create relationships between these new tables and their predecessors through the use of foreign keys.

The second normal form applies only to tables with multiple field primarykeys. Take the following table for example.


Rep ID* Rep First Name Rep Last Name Client ID* Client Time With Client
TS-89 Gilroy Gladstone 978 US Corp 14 hrs
TS-89 Gilroy Gladstone 665 Taggarts 26 hrs
TS-89 Gilroy Gladstone 782 Kilroy Inc. 9 hrs
RK-56 Mary Mayhem 221 Italiana 67 hrs
RK-56 Mary Mayhem 982 Linkers 2 hrs
RK-56 Mary Mayhem 665 Taggarts 4 hrs

This table is already in first normal form. It has a primary keyconsisting of Rep ID and Client ID since neither alone can be considered a
unique value.


  • The second normal form states that each field in a multiple field primary key table must be directly related to the entire primary key. Or in other words,each non-key field should be a fact about all the fields in the primary key.

  • Only fields that are absolutely necessary should show up in our table, all other fields should reside in different tables.

  • In order to find out which fields are necessary we should ask a few questions of our database.
In our preceding example, I should ask the question "What information is this table meant to store?" Currently, the answer is not obvious. It may bemeant to store information about individual clients, or it could be

holding data for employees time cards.

As a further example, if my database is going to contain records of employees I may want a table of
demographics and a table for payroll. The demographics will have all the employees personal information and will assign them an ID number. I should not have to enter the data twice, the payroll table on the other hand should refer to each employee only by their ID number. I can then link the two tables by a relationship and will then have access to all the necessary data.



  • In the table of the preceding example we are devoting three field to the identification of the employee and two to the identification of the client. I could identify them with only one field each -- the primary

key. I can then take out the extraneous fields and put them in their own
table. For example, my database would then look like the following.


Rep ID* Client ID* Time With Client
TS-89 978 14 hrs
TS-89 665 26 hrs
TS-89 782 9 hrs
RK-56 221 67 hrs
RK-56 982 2 hrs
RK-56 665 4 hrs

The above table contains time card information.


Rep ID* First Name Last Name
TS-89 Gilroy Gladstone
RK-56 Mary Mayhem

The above table contains Employee Information.


Client ID* Client Name
978 US Corp
665 Taggarts
782 Kilroy Inc.
221 Italiana
982 Linkers

The above table contains Client Information


These tables are now in normal form. By splitting off the unnecessary information and putting it in its own tables, we have eliminated redundancy and put our first table in second normal form. These tables are now ready to be linked through relationship to each other.



Third Normal Form (3NF)

There are two basic requirements for a database to be in third normal form:

  • Already meet the requirements of both 1NF and 2NF
  • Remove columns that are not fully dependent upon the primary key.


Imagine that we have a table of widget orders that contains the following attributes:

  • Order Number
  • Customer Number
  • Unit Price
  • Quantity
  • Total

Order NumberCustomer NumberUnit PriceQuantityTotal







    • Remember, our first requirement is that the table must satisfy the requirements of 1NF and 2NF.
      • Are there any duplicative columns? No.
      • Do we have a primary key? Yes, the order number.
Therefore, we satisfy the requirements of 1NF.
      • Are there any subsets of data that apply to multiple rows? No, so we also satisfy the requirements of 2NF.

      • Now, are all of the columns fully dependent upon the primary key?
The customer number varies with the order number and it doesn't appear to depend upon any of the other fields.

What about the unit price? This field could be dependent upon the customer number in a situation where we charged each customer a set price. However, looking at the data above, it appears we sometimes charge the same customer different prices. Therefore, the unit price is fully dependent upon the order number. The quantity of items also varies from order to order, so we're OK there.

What about the total? It looks like we might be in trouble here. The total can be derived by multiplying the unit price by the quantity, therefore it's not fully dependent upon the primary key. We must remove it from the table to comply with the third normal form. Perhaps we use the following attributes:

  • Order Number
  • Customer Number
  • Unit Price
  • Quantity

Now our table is in 3NF. But, you might ask, what about the total? This is a derived field and it's best not to store it in the database at all. We can simply compute it "on the fly" when performing database queries.

Fourth Normal Form (4NF)


Finally, fourth normal form (4NF) has one additional requirement:
  • Meet all the requirements of the third normal form.
  • A relation is in 4NF if it has no multi-valued dependencies.
Remember, these normalization guidelines are cumulative. For a database to be in 2NF, it must first fulfill all the criteria of a 1NF database.

  • In a many-to-many relationship, independent entities can not be stored in the same table.
Since it only applies to the many-to-many relationship, most developers can rightfully ignore this rule. But it does come in handy in certain situations, such as this one.

Consider the following example:



Pizza Delivery Permutations
Restaurant Pizza Variety Delivery Area
A1 Pizza Thick Crust Springfield
A1 Pizza Thick Crust Shelbyville
A1 Pizza Thick Crust Capital City
A1 Pizza Stuffed Crust Springfield
A1 Pizza Stuffed Crust Shelbyville
A1 Pizza Stuffed Crust Capital City
Elite Pizza Thin Crust Capital City
Elite Pizza Stuffed Crust Capital City
Vincenzo's Pizza Thick Crust Springfield
Vincenzo's Pizza Thick Crust Shelbyville
Vincenzo's Pizza Thin Crust Springfield
Vincenzo's Pizza Thin Crust Shelbyville


  • Each row indicates that a given restaurant can deliver a given variety of pizza to a given area.
  • The table has no non-key attributes because its only key is {Restaurant, Pizza Variety, Delivery Area}. Therefore it meets all normal forms up to BCNF.
  • It does not, however, meet 4NF. The problem is that the table features two non-trivial multivalued dependencies on the {Restaurant} attribute (which is not a superkey). The dependencies are:

  • {Restaurant} →→ {Pizza Variety}
  • {Restaurant} →→ {Delivery Area}


  • These non-trivial multivalued dependencies on a non-superkey reflect the fact that the varieties of pizza a restaurant offers are independent from the areas to which the restaurant delivers. This state of affairs leads to redundancy in the table: for example, we are told three times that A1 Pizza offers Stuffed Crust, and if A1 Pizza start producing Cheese Crust pizzas then we will need to add multiple rows, one for each of A1 Pizza's delivery areas. There is, moreover, nothing to prevent us from doing this incorrectly: we might add Cheese Crust rows for all but one of A1 Pizza's delivery areas, thereby failing to respect the multivalued dependency {Restaurant} →→ {Pizza Variety}.


  • To eliminate the possibility of these anomalies, we must place the facts about varieties offered into a different table from the facts about delivery areas, yielding two tables that are both in 4NF:


Varieties By Restaurant
Restaurant Pizza Variety
A1 Pizza Thick Crust
A1 Pizza Stuffed Crust
Elite Pizza Thin Crust
Elite Pizza Stuffed Crust
Vincenzo's Pizza Thick Crust
Vincenzo's Pizza Thin Crust


Delivery Areas By Restaurant
Restaurant Delivery Area
A1 Pizza Springfield
A1 Pizza Shelbyville
A1 Pizza Capital City
Elite Pizza Capital City
Vincenzo's Pizza Springfield
Vincenzo's Pizza Shelbyville



In contrast, if the pizza varieties offered by a restaurant sometimes did legitimately vary from one delivery area to another, the original three-column table would satisfy 4NF.



Fifth Normal Form

There is one more form of normalization which is sometimes applied, but it is indeed very esoteric and is in most cases probably not required to get the most functionality out of your data structure or application.

Its tenet suggests:

  • The original table must be reconstructed from the tables into which it has been broken down.
  • The benefit of applying this rule ensures you have not created any extraneous columns in your tables, and that all of the table structures you have created are only as large as they need to be.


It's good practice to apply this rule, but unless you're dealing with a very large data schema you probably won't need it.




No comments: