What is Normalization?

Normalization is the process of efficiently organizing data in a database.

There are two goals of the normalization process:

eliminating redundant data (for example, storing the same data in more than one table)
ensuring data dependencies make sense (only storing related data in a table).

Both of these are worthy goals as they reduce the amount of space a database consumes and ensure that data is logically stored.

The concept of database normalization is not unique to any particular Relational
Database Management System. It can be applied to any of several implications
of relational databases including Microsoft Access, dBase, Oracle, etc. The
benefits of Normalizing your database include:

Avoiding repetitive entries
Reducing required storage space
Preventing the need to restructure existing tables to accommodate new data.
Increased speed and flexibility of queries, sorts, and summaries.

The Normal Forms

The database community has developed a series of guidelines for ensuring that databases are normalized. These are referred to as normal forms and are numbered from one (the lowest form of normalization, referred to as first normal form or 1NF) through five (fifth normal form or 5NF).

In practical applications, you'll often see 1NF, 2NF, and 3NF along with the occasional 4NF. Fifth normal form is very rarely seen

In order to reach peak efficiency, it is recommended that relational databases be normalized through at least the third normal form.
In order to normalize a database, each table should have a primary key field that uniquely identifies each record in that table. A primary key can consist of a single field (an ID Number field for instance) or a combination of two or more fields that together make a unique key (called a multiple field primary key).

First Normal Form (1NF)

First normal form (1NF) sets the very basic rules for an organized database:

Eliminate duplicative columns from the same table.
Create separate tables for each group of related data and identify each row with a unique column or set of columns (the primary key).

For a table to be in first normal form, data must be broken up into the smallest units possible.

For example, the following table is not in first normal form.

Name	Address	Phone
Sally Singer	123 Broadway New York, NY, 11234	(111) 222-3345
Jason Jumper	456 Jolly Jumper St. Trenton NJ, 11547	(222) 334-5566

To conform to first normal form, this table would require additional fields.

The name field should be divided into first and last name and the address should

be divided by street, city state, and zip like this.

ID	First	Last	Street	City	State	Zip	Phone
564	Sally	Singer	123 Broadway	New York	NY	11234	(111) 222-3345
565	Jason	Jumper	456 Jolly Jumper St.	Trenton	NJ	11547	(222) 334-5566

In addition to breaking data up into the smallest meaningful values, tables

in first normal form should not contain repetitions groups of fields such as
in the following table.

Rep ID	Representative	Client 1	Time 1	Client 2	Time 2	Client 3	Time 3
TS-89	Gilroy Gladstone	US Corp.	14 hrs	Taggarts	26 hrs	Kilroy Inc.	9 hrs
RK-56	Mary Mayhem	Italiana	67 hrs	Linkers	2 hrs

The problem here is that each representative can have multiple clients not all will have three. Some may have less as is the case in the second record, tying up storage space in your database that is not being used, and some may have more, in which case there are not enough fields. The solution to this is to add a record for each new piece of information.

Rep ID	Rep First Name	Rep Last Name	Client	Time With Client
TS-89	Gilroy	Gladstone	US Corp	14 hrs
TS-89	Gilroy	Gladstone	Taggarts	26 hrs
TS-89	Gilroy	Gladstone	Kilroy Inc.	9 hrs
RK-56	Mary	Mayhem	Italiana	67 hrs
RK-56	Mary	Mayhem	Linkers	2 hrs

*Notice the splitting of the first and last name fields again.

This table is now in first normal form. Note that by avoiding repeating groups of fields, we have created a new problem in that there are identical values in the primary key field, violating the rules of the primary key. In order to remedy this, we need to have some other way of identifying each record. This can be done with the creation of a new key called client ID.

Rep ID*	Rep First Name	Rep Last Name	Client ID*	Client	Time With Client
TS-89	Gilroy	Gladstone	978	US Corp	14 hrs
TS-89	Gilroy	Gladstone	665	Taggarts	26 hrs
TS-89	Gilroy	Gladstone	782	Kilroy Inc.	9 hrs
RK-56	Mary	Mayhem	221	Italiana	67 hrs
RK-56	Mary	Mayhem	982	Linkers	2 hrs

This new field can now be used in conjunction with the Rep ID field to create a multiple field primary key. This will prevent confusion if ever more than one Representative were to serve a single client.

Second Normal Form (2NF)

Second normal form (2NF) further addresses the concept of removing duplication data:

Meet all the requirements of the first normal form.
Remove subsets of data that apply to multiple rows of a table and place them in separate tables.
Create relationships between these new tables and their predecessors through the use of foreign keys.

The second normal form applies only to tables with multiple field primarykeys. Take the following table for example.

Rep ID*	Rep First Name	Rep Last Name	Client ID*	Client	Time With Client
TS-89	Gilroy	Gladstone	978	US Corp	14 hrs
TS-89	Gilroy	Gladstone	665	Taggarts	26 hrs
TS-89	Gilroy	Gladstone	782	Kilroy Inc.	9 hrs
RK-56	Mary	Mayhem	221	Italiana	67 hrs
RK-56	Mary	Mayhem	982	Linkers	2 hrs
RK-56	Mary	Mayhem	665	Taggarts	4 hrs

This table is already in first normal form. It has a primary keyconsisting of Rep ID and Client ID since neither alone can be considered a
unique value.

The second normal form states that each field in a multiple field primary key table must be directly related to the entire primary key. Or in other words,each non-key field should be a fact about all the fields in the primary key.

Only fields that are absolutely necessary should show up in our table, all other fields should reside in different tables.

In order to find out which fields are necessary we should ask a few questions of our database.

In our preceding example, I should ask the question "What information is this table meant to store?" Currently, the answer is not obvious. It may bemeant to store information about individual clients, or it could be

holding data for employees time cards.

As a further example, if my database is going to contain records of employees I may want a table of
demographics and a table for payroll. The demographics will have all the employees personal information and will assign them an ID number. I should not have to enter the data twice, the payroll table on the other hand should refer to each employee only by their ID number. I can then link the two tables by a relationship and will then have access to all the necessary data.

In the table of the preceding example we are devoting three field to the identification of the employee and two to the identification of the client. I could identify them with only one field each -- the primary

key. I can then take out the extraneous fields and put them in their own
table. For example, my database would then look like the following.

Rep ID*	Client ID*	Time With Client
TS-89	978	14 hrs
TS-89	665	26 hrs
TS-89	782	9 hrs
RK-56	221	67 hrs
RK-56	982	2 hrs
RK-56	665	4 hrs

The above table contains time card information.

Rep ID*	First Name	Last Name
TS-89	Gilroy	Gladstone
RK-56	Mary	Mayhem

The above table contains Employee Information.

Client ID*	Client Name
978	US Corp
665	Taggarts
782	Kilroy Inc.
221	Italiana
982	Linkers

The above table contains Client Information

These tables are now in normal form. By splitting off the unnecessary information and putting it in its own tables, we have eliminated redundancy and put our first table in second normal form. These tables are now ready to be linked through relationship to each other.

Third Normal Form (3NF)

There are two basic requirements for a database to be in third normal form:

Already meet the requirements of both 1NF and 2NF
Remove columns that are not fully dependent upon the primary key.

Imagine that we have a table of widget orders that contains the following attributes:

Order Number
Customer Number
Unit Price
Quantity
Total

Order Number	Customer Number	Unit Price	Quantity	Total

Remember, our first requirement is that the table must satisfy the requirements of 1NF and 2NF.

Are there any duplicative columns? No.
Do we have a primary key? Yes, the order number.

Therefore, we satisfy the requirements of 1NF.

Are there any subsets of data that apply to multiple rows? No, so we also satisfy the requirements of 2NF.

Now, are all of the columns fully dependent upon the primary key?

The customer number varies with the order number and it doesn't appear to depend upon any of the other fields.

What about the unit price? This field could be dependent upon the customer number in a situation where we charged each customer a set price. However, looking at the data above, it appears we sometimes charge the same customer different prices. Therefore, the unit price is fully dependent upon the order number. The quantity of items also varies from order to order, so we're OK there.

What about the total? It looks like we might be in trouble here. The total can be derived by multiplying the unit price by the quantity, therefore it's not fully dependent upon the primary key. We must remove it from the table to comply with the third normal form. Perhaps we use the following attributes:

Order Number
Customer Number
Unit Price
Quantity

Now our table is in 3NF. But, you might ask, what about the total? This is a derived field and it's best not to store it in the database at all. We can simply compute it "on the fly" when performing database queries.

Fourth Normal Form (4NF)

Finally, fourth normal form (4NF) has one additional requirement:

Meet all the requirements of the third normal form.
A relation is in 4NF if it has no multi-valued dependencies.

Remember, these normalization guidelines are cumulative. For a database to be in 2NF, it must first fulfill all the criteria of a 1NF database.

In a many-to-many relationship, independent entities can not be stored in the same table.

Since it only applies to the many-to-many relationship, most developers can rightfully ignore this rule. But it does come in handy in certain situations, such as this one.

Consider the following example:

Pizza Delivery Permutations

Restaurant	Pizza Variety	Delivery Area
A1 Pizza	Thick Crust	Springfield
A1 Pizza	Thick Crust	Shelbyville
A1 Pizza	Thick Crust	Capital City
A1 Pizza	Stuffed Crust	Springfield
A1 Pizza	Stuffed Crust	Shelbyville
A1 Pizza	Stuffed Crust	Capital City
Elite Pizza	Thin Crust	Capital City
Elite Pizza	Stuffed Crust	Capital City
Vincenzo's Pizza	Thick Crust	Springfield
Vincenzo's Pizza	Thick Crust	Shelbyville
Vincenzo's Pizza	Thin Crust	Springfield
Vincenzo's Pizza	Thin Crust	Shelbyville

Each row indicates that a given restaurant can deliver a given variety of pizza to a given area.

The table has no non-key attributes because its only key is {Restaurant, Pizza Variety, Delivery Area}. Therefore it meets all normal forms up to BCNF.

It does not, however, meet 4NF. The problem is that the table features two non-trivial multivalued dependencies on the {Restaurant} attribute (which is not a superkey). The dependencies are:

{Restaurant} →→ {Pizza Variety}
{Restaurant} →→ {Delivery Area}

These non-trivial multivalued dependencies on a non-superkey reflect the fact that the varieties of pizza a restaurant offers are independent from the areas to which the restaurant delivers. This state of affairs leads to redundancy in the table: for example, we are told three times that A1 Pizza offers Stuffed Crust, and if A1 Pizza start producing Cheese Crust pizzas then we will need to add multiple rows, one for each of A1 Pizza's delivery areas. There is, moreover, nothing to prevent us from doing this incorrectly: we might add Cheese Crust rows for all but one of A1 Pizza's delivery areas, thereby failing to respect the multivalued dependency {Restaurant} →→ {Pizza Variety}.

To eliminate the possibility of these anomalies, we must place the facts about varieties offered into a different table from the facts about delivery areas, yielding two tables that are both in 4NF:

Varieties By Restaurant

Restaurant	Pizza Variety
A1 Pizza	Thick Crust
A1 Pizza	Stuffed Crust
Elite Pizza	Thin Crust
Elite Pizza	Stuffed Crust
Vincenzo's Pizza	Thick Crust
Vincenzo's Pizza	Thin Crust

Delivery Areas By Restaurant

Restaurant	Delivery Area
A1 Pizza	Springfield
A1 Pizza	Shelbyville
A1 Pizza	Capital City
Elite Pizza	Capital City
Vincenzo's Pizza	Springfield
Vincenzo's Pizza	Shelbyville

In contrast, if the pizza varieties offered by a restaurant sometimes did legitimately vary from one delivery area to another, the original three-column table would satisfy 4NF.

Fifth Normal Form

There is one more form of normalization which is sometimes applied, but it is indeed very esoteric and is in most cases probably not required to get the most functionality out of your data structure or application.

Its tenet suggests:

The original table must be reconstructed from the tables into which it has been broken down.
The benefit of applying this rule ensures you have not created any extraneous columns in your tables, and that all of the table structures you have created are only as large as they need to be.

It's good practice to apply this rule, but unless you're dealing with a very large data schema you probably won't need it.

anoops-techz

Thursday, April 16, 2009

Database Normalization

Third Normal Form (3NF)

No comments: