What is Normalization?
- eliminating redundant data (for example, storing the same data in more than one table)
- ensuring data dependencies make sense (only storing related data in a table).
Both of these are worthy goals as they reduce the amount of space a database consumes and ensure that data is logically stored.
The concept of database normalization is not unique to any particular Relational
Database Management System. It can be applied to any of several implications
of relational databases including Microsoft Access, dBase, Oracle, etc. The
benefits of Normalizing your database include:
- Avoiding repetitive entries
- Reducing required storage space
- Preventing the need to restructure existing tables to accommodate new data.
- Increased speed and flexibility of queries, sorts, and summaries.
The Normal Forms
In practical applications, you'll often see 1NF, 2NF, and 3NF along with the occasional 4NF. Fifth normal form is very rarely seen
- In order to reach peak efficiency, it is recommended that relational databases be normalized through at least the third normal form.
- In order to normalize a database, each table should have a primary key field that uniquely identifies each record in that table. A primary key can consist of a single field (an ID Number field for instance) or a combination of two or more fields that together make a unique key (called a multiple field primary key).
First Normal Form (1NF)
First normal form (1NF) sets the very basic rules for an organized database:
- Eliminate duplicative columns from the same table.
- Create separate tables for each group of related data and identify each row with a unique column or set of columns (the primary key).
For a table to be in first normal form, data must be broken up into the smallest units possible.
For example, the following table is not in first normal form.
Name | Address | Phone |
Sally Singer | 123 Broadway New York, NY, 11234 | (111) 222-3345 |
Jason Jumper | 456 Jolly Jumper St. Trenton NJ, 11547 | (222) 334-5566 |
To conform to first normal form, this table would require additional fields.
- The name field should be divided into first and last name and the address should
be divided by street, city state, and zip like this.
ID | First | Last | Street | City | State | Zip | Phone |
564 | Sally | Singer | 123 Broadway | New York | NY | 11234 | (111) 222-3345 |
565 | Jason | Jumper | 456 Jolly Jumper St. | Trenton | NJ | 11547 | (222) 334-5566 |
- In addition to breaking data up into the smallest meaningful values, tables
in the following table.
Rep ID | Representative | Client 1 | Time 1 | Client 2 | Time 2 | Client 3 | Time 3 |
TS-89 | Gilroy Gladstone | US Corp. | 14 hrs | Taggarts | 26 hrs | Kilroy Inc. | 9 hrs |
RK-56 | Mary Mayhem | Italiana | 67 hrs | Linkers | 2 hrs |
The problem here is that each representative can have multiple clients not all will have three. Some may have less as is the case in the second record, tying up storage space in your database that is not being used, and some may have more, in which case there are not enough fields. The solution to this is to add a record for each new piece of information.
Rep ID | Rep First Name | Rep Last Name | Client | Time With Client |
TS-89 | Gilroy | Gladstone | US Corp | 14 hrs |
TS-89 | Gilroy | Gladstone | Taggarts | 26 hrs |
TS-89 | Gilroy | Gladstone | Kilroy Inc. | 9 hrs |
RK-56 | Mary | Mayhem | Italiana | 67 hrs |
RK-56 | Mary | Mayhem | Linkers | 2 hrs |
*Notice the splitting of the first and last name fields again.
This table is now in first normal form. Note that by avoiding repeating groups of fields, we have created a new problem in that there are identical values in the primary key field, violating the rules of the primary key. In order to remedy this, we need to have some other way of identifying each record. This can be done with the creation of a new key called client ID.
Rep ID* | Rep First Name | Rep Last Name | Client ID* | Client | Time With Client |
TS-89 | Gilroy | Gladstone | 978 | US Corp | 14 hrs |
TS-89 | Gilroy | Gladstone | 665 | Taggarts | 26 hrs |
TS-89 | Gilroy | Gladstone | 782 | Kilroy Inc. | 9 hrs |
RK-56 | Mary | Mayhem | 221 | Italiana | 67 hrs |
RK-56 | Mary | Mayhem | 982 | Linkers | 2 hrs |
This new field can now be used in conjunction with the Rep ID field to create a multiple field primary key. This will prevent confusion if ever more than one Representative were to serve a single client.
Second Normal Form (2NF)
- Meet all the requirements of the first normal form.
- Remove subsets of data that apply to multiple rows of a table and place them in separate tables.
- Create relationships between these new tables and their predecessors through the use of foreign keys.
The second normal form applies only to tables with multiple field primarykeys. Take the following table for example.
Rep ID* | Rep First Name | Rep Last Name | Client ID* | Client | Time With Client |
TS-89 | Gilroy | Gladstone | 978 | US Corp | 14 hrs |
TS-89 | Gilroy | Gladstone | 665 | Taggarts | 26 hrs |
TS-89 | Gilroy | Gladstone | 782 | Kilroy Inc. | 9 hrs |
RK-56 | Mary | Mayhem | 221 | Italiana | 67 hrs |
RK-56 | Mary | Mayhem | 982 | Linkers | 2 hrs |
RK-56 | Mary | Mayhem | 665 | Taggarts | 4 hrs |
This table is already in first normal form. It has a primary keyconsisting of Rep ID and Client ID since neither alone can be considered a
unique value.
- The second normal form states that each field in a multiple field primary key table must be directly related to the entire primary key. Or in other words,each non-key field should be a fact about all the fields in the primary key.
- Only fields that are absolutely necessary should show up in our table, all other fields should reside in different tables.
- In order to find out which fields are necessary we should ask a few questions of our database.
holding data for employees time cards.
As a further example, if my database is going to contain records of employees I may want a table of
demographics and a table for payroll. The demographics will have all the employees personal information and will assign them an ID number. I should not have to enter the data twice, the payroll table on the other hand should refer to each employee only by their ID number. I can then link the two tables by a relationship and will then have access to all the necessary data.
- In the table of the preceding example we are devoting three field to the identification of the employee and two to the identification of the client. I could identify them with only one field each -- the primary
key. I can then take out the extraneous fields and put them in their own
table. For example, my database would then look like the following.
Rep ID* | Client ID* | Time With Client |
TS-89 | 978 | 14 hrs |
TS-89 | 665 | 26 hrs |
TS-89 | 782 | 9 hrs |
RK-56 | 221 | 67 hrs |
RK-56 | 982 | 2 hrs |
RK-56 | 665 | 4 hrs |
The above table contains time card information.
Rep ID* | First Name | Last Name |
TS-89 | Gilroy | Gladstone |
RK-56 | Mary | Mayhem |
The above table contains Employee Information.
Client ID* | Client Name |
978 | US Corp |
665 | Taggarts |
782 | Kilroy Inc. |
221 | Italiana |
982 | Linkers |
The above table contains Client Information
These tables are now in normal form. By splitting off the unnecessary information and putting it in its own tables, we have eliminated redundancy and put our first table in second normal form. These tables are now ready to be linked through relationship to each other.
Third Normal Form (3NF)
There are two basic requirements for a database to be in third normal form:
- Already meet the requirements of both 1NF and 2NF
- Remove columns that are not fully dependent upon the primary key.
Imagine that we have a table of widget orders that contains the following attributes:
- Order Number
- Customer Number
- Unit Price
- Quantity
- Total
Order Number | Customer Number | Unit Price | Quantity | Total |
- Remember, our first requirement is that the table must satisfy the requirements of 1NF and 2NF.
- Are there any duplicative columns? No.
- Do we have a primary key? Yes, the order number.
- Are there any subsets of data that apply to multiple rows? No, so we also satisfy the requirements of 2NF.
- Now, are all of the columns fully dependent upon the primary key?
What about the unit price? This field could be dependent upon the customer number in a situation where we charged each customer a set price. However, looking at the data above, it appears we sometimes charge the same customer different prices. Therefore, the unit price is fully dependent upon the order number. The quantity of items also varies from order to order, so we're OK there.
What about the total? It looks like we might be in trouble here. The total can be derived by multiplying the unit price by the quantity, therefore it's not fully dependent upon the primary key. We must remove it from the table to comply with the third normal form. Perhaps we use the following attributes:
- Order Number
- Customer Number
- Unit Price
- Quantity
Fourth Normal Form (4NF)
- Meet all the requirements of the third normal form.
- A relation is in 4NF if it has no multi-valued dependencies.
- In a many-to-many relationship, independent entities can not be stored in the same table.
Consider the following example:
Restaurant | Pizza Variety | Delivery Area |
---|---|---|
A1 Pizza | Thick Crust | Springfield |
A1 Pizza | Thick Crust | Shelbyville |
A1 Pizza | Thick Crust | Capital City |
A1 Pizza | Stuffed Crust | Springfield |
A1 Pizza | Stuffed Crust | Shelbyville |
A1 Pizza | Stuffed Crust | Capital City |
Elite Pizza | Thin Crust | Capital City |
Elite Pizza | Stuffed Crust | Capital City |
Vincenzo's Pizza | Thick Crust | Springfield |
Vincenzo's Pizza | Thick Crust | Shelbyville |
Vincenzo's Pizza | Thin Crust | Springfield |
Vincenzo's Pizza | Thin Crust | Shelbyville |
- Each row indicates that a given restaurant can deliver a given variety of pizza to a given area.
- The table has no non-key attributes because its only key is {Restaurant, Pizza Variety, Delivery Area}. Therefore it meets all normal forms up to BCNF.
- It does not, however, meet 4NF. The problem is that the table features two non-trivial multivalued dependencies on the {Restaurant} attribute (which is not a superkey). The dependencies are:
- {Restaurant} →→ {Pizza Variety}
- {Restaurant} →→ {Delivery Area}
- These non-trivial multivalued dependencies on a non-superkey reflect the fact that the varieties of pizza a restaurant offers are independent from the areas to which the restaurant delivers. This state of affairs leads to redundancy in the table: for example, we are told three times that A1 Pizza offers Stuffed Crust, and if A1 Pizza start producing Cheese Crust pizzas then we will need to add multiple rows, one for each of A1 Pizza's delivery areas. There is, moreover, nothing to prevent us from doing this incorrectly: we might add Cheese Crust rows for all but one of A1 Pizza's delivery areas, thereby failing to respect the multivalued dependency {Restaurant} →→ {Pizza Variety}.
- To eliminate the possibility of these anomalies, we must place the facts about varieties offered into a different table from the facts about delivery areas, yielding two tables that are both in 4NF:
Restaurant | Pizza Variety |
---|---|
A1 Pizza | Thick Crust |
A1 Pizza | Stuffed Crust |
Elite Pizza | Thin Crust |
Elite Pizza | Stuffed Crust |
Vincenzo's Pizza | Thick Crust |
Vincenzo's Pizza | Thin Crust |
Restaurant | Delivery Area |
---|---|
A1 Pizza | Springfield |
A1 Pizza | Shelbyville |
A1 Pizza | Capital City |
Elite Pizza | Capital City |
Vincenzo's Pizza | Springfield |
Vincenzo's Pizza | Shelbyville |
In contrast, if the pizza varieties offered by a restaurant sometimes did legitimately vary from one delivery area to another, the original three-column table would satisfy 4NF.
Fifth Normal Form
There is one more form of normalization which is sometimes applied, but it is indeed very esoteric and is in most cases probably not required to get the most functionality out of your data structure or application.
Its tenet suggests:
- The original table must be reconstructed from the tables into which it has been broken down.
- The benefit of applying this rule ensures you have not created any extraneous columns in your tables, and that all of the table structures you have created are only as large as they need to be.
It's good practice to apply this rule, but unless you're dealing with a very large data schema you probably won't need it.
No comments:
Post a Comment