The 1-2-3 of Data Normalization
Abstract
In the 1960s and 1970s, Dr Edgar Codd, a British computer scientist, developed a means of reducing data duplication with in relational databases. While six “forms” of the process exist, 1-4 ought to be at every database application developer’s finger tips. Fifty, nearly sixty, years have elapsed. Not only do these guidelines still apply today, when we forget them, we can cripple the flexibility and dynamic ability for software to grow and adapt.
Number Zero
The Number Zero of the data normalization forms is called: Unnormalized. While I would love to joke about this structure of data, we all encounter it just too often. Therefore, let’s acknowledge that it exists.
Movin’ on!
Data Table (a requisite element)
To normalize data, one must be working within a relational database environment, such as Oracle and such. The table must have zero, one, or more rows/records. Each row/record must have a primary key. The primary key must be unique. Quick checklist:
- Data stored in tables
- Tables have 1 or more columns (fields)
- Tables have 0, 1, or more rows (records)
- Each row has a unique primary key that uniquely identifies one row
- Each row of data must relate to 1 “item” (topic/element/etc)
My Short Hand
- Model the world as it is, not as you want it to be. The world is messy. It is ok to use complex data structures to model complex data. “Complicated things are complicated”
- Think first: then add tables when needed; add fields when needed
- Look for lateral duplication
- Look for vertical duplication
- Look for non-data data (messy/junky stuff)
Lateral Duplication
Do not add a field called Contact2, ever.
Do not add a field called email2, ever.
Do not add a field called Address2, ever.
If you have multiple contacts, add a table for the contacts. One row for each contact.
If you find that your tables have columns duplicating sideways (laterally), then create a new table! Hey people have multiple email addresses. Create a table with a primary key and 1 email address. Validate the quality of the email to the best of the applications ability. Each new email address is a new row. Done.
This is a casual, but functional means of defining the First Normal Form.
Vertical Duplication
Look at your tables. Do you have important data that is duplicated by users going vertically in a table? You can see this if you have multiple clients called Acme Anvil. If you have five contact people named Wiley E Coyote, then you have an opportunity to improve the data model.
The solution tends to be new tables and breaking the data into smaller building blocks. When done well you work towards both the Second and Third Normal Forms.
Non-Data Data
When we ask users to type data free formed into fields, we tend to get junk. They enter “data” that is not-really-data. I am not talking about client names or personal names but all of those reason fields and note fields. That free-formed information. We mistype; we abbreviate; we get verbose; we get terse. In short, we get junk. Unless we use artificial intelligence or wicked complicate code, we lose the value of what the user is trying to tell us! Junk is junk!
If data matter to us, then we ought to use lookup tables and codes so that we capture what users need to track. It takes listening and tracking what they are doing. That is a great reason to talk with users and talk with your client. It is their data. We like good data. Good data has value. Bad data is bad.
First Normal Form (1NF)
To satisfy the First Normal Form of relational databases, then:
- Each column(field) of a table must have a “single value”. This maybe a primary key, a foreign key, a client name, etc. When we store compound data within a field, we’ve stepped off the path. Sorry JSON.
How do we break this rule?
Let’s imagine a table with client data. The classic table has the unique primary key and the client name. Imagine needing an address. Let’s add an address
column to the table. Into the column, we type the address: Spui 70 (boĆ®te postale 12600),2500 DJ La Haye, Nederland (That’s the Hague). Suddenly, I need another address for the same location. So I add in a delivery address for heavy trucks. If I put a second address into the ADDRESS
field, then I have eliminated the usefulness of the Address field. And I have not followed the First Normalization Form.
Another example with the client table, imagine a field for the email address of your contact person there. Store one address: compliant. Store two in that field and your are not compliant. Game over.
fred@flintstone.fake
is fair
fred@flintstone.fake, barney@flintstone.fake
in the same field breaks the rules.
Why is this bad?
Imagine that you need to email the client and you have two email addresses in the same field called EMAIL
. If you send an email with one vendor, they want commas to separate. If you use the list with a different email vendor, they want a semi-colon between the two email address. You’ve lost control of the formatting, validity and the usefulness of the email address. It is useless data (unless you write code that parses it out.
Same with printing a mailing label for the Hague address above. You want precisely and exactly one address on a shipping label. If you ADDRESS
field has two, then you are sunk. You lost the data game. No points for you.
Second Normal Form (2NF)
To satisfy the Second Normal Form (2NF) then:
- Each row must have a unique primary key.
- Rows ought describe unique data sets
Sure, yes, that is in the requisite steps listed above. Sometimes, we create data tables that require applications to always use multiple keys to find a desired row. In this situation, we use multiple or compound keys to access data.
Violating this rule happens all of the time. Can you envision the need to store a client in the client table that may have two division at two locations? Some people would create two separate clients each with one address. But are they the same client or one? We now have the name duplicated. In this row, we have ACME ANVILS
Primary Key | Client Name | Address | |
11235 | Acme Anvil | 123 Roadrunner Ave, WB | |
2358 | Acme Anvil | 123 Wiley Ln, WB |
We do not know if this is one client with two addresses or two clients. We failed to model the complexity of the real-world data. We must solve problems with multiple divisions and multiple addresses. The solution tends to involve adding tables that have greater specificity.
We are creating redundancy in the data going vertically. Which Acme Anvil do we need? How do we identify the one we want?
Why is this bad?
You want to place your finger on the required data elements with ease. If you don’t know which is which, then how will the application?
How to Fix?
Typically, we need to add fields and tables to more completely and more robustly model the data as it is, not as we want it to be. Yes, the real world is messy. Clients have divisions, multiple offices, multiple addresses, multiple contact people. Roll the sleeve up, figure out the data needed and how to relate it.
Do not be afraid view the data world as small data blocks that snap together like Lego bricks.
In this case, maybe we have a table for divisions. We create a table for addresses. Maybe we get super clever and add a client foreign key that references the primary key in its own table. This self-reference creates a tree structure or a hierarchical query. Super powerful for modelling data that nests within itself. With this technique, a client becomes a child of another client: which can represent a division within a client.
Third Normal Form (3NF)
Data typed by hand is:
- Flawed/error-prone
- In consistent
To satisfy the 3rd Normal Form:
- Use Lookup Tables and lists to maintain consistency within the data
Often, when we are asking users to classify or identify the qualities of data, we are asking them to type something. We had a situation today, when a user knew that a client would have a tax ID, but did not yet have a tax ID. So the user typed: 0000.0000.0000 – which meant, “I put in a place holder”. It is not a valid tax id. The team failed to model this event in the lifecycle of client data. We use databases to organize and query data, often massive amounts of data. Consistency matters. Maybe there ought to be a field that says: Tax ID pending. Or maybe there is a lifecycle for the tax ID.
- Client does not have a tax ID and does not need one
- Client applied for a tax ID and it is pending
- Client has a tax ID and it typed in
- The system validated the tax ID against an external authority.
Now we have four states of the tax ID in addition to the tax ID number. We get more information.
To satisfy the Third Normalization Form, we should create a table for Tax ID status:
- No Tax ID Required (
NONE
) - Tax ID Pending (
PENDING
) - Tax ID entered (
EXISTS
) - Tax ID validated with external authority (
VALID
)
We then use a look up table in conjunction with the Tax ID field to provide a more complete understanding of the data we are capturing. if a Tax ID is empty, we would have to guess. Is the Tax ID empty because the client does not need one? Or is it empty because the Tax ID is pending? If guessing, you’ll be right some times! 50-50, right?
When data become dependent on each other, we add tables and lookups to enforce consistency within the data set.
Had we asked users to type a reason for the absence of the Tax ID, we would see all sort of junk! We turned an opportunity to improve the quality of data into a mess. If you user selects from lists, then we are working harder to satisfy the Third Normal Form. My short hand version of this sounds like this: Use Look up tables to validate data.
Other Normal Forms
Data get complicated because, friends, our world is complicated. We can handle tables with multiple data dependencies. Normal Forms four, five and six to exist. And while I barely understand the terminology used in the academic literature and even on-line. Following these forms lead us into gloriously interesting structures. Child tables that matrix to multiple parent tables. Once playing in these deeper waters, peers tend to sit and discuss how to work through these issues.
Conclusion
When making decisions, we always make trade-offs. We violate the normal forms both accidently and on-purpose. In my podcast, The Soul of an Internet Machine, I discuss data normalization. In the late 1960s, people were not writing software that ran globally supporting 100 spoken languages across two-dozen time zone circling the earth. We did not create a means of supporting data in multiple languages simultaneously. If we duplicate the data laterally for English, French, Dutch, German, and Italian, we violate normal forms. If we create new rows, we may also violate normal forms (depends on the structure). If we store the phrases within JSON data inside of a field, we again violated the rules. So we plan and discuss.
When we add fields such as contact1, contact2, contact3, contact4
or even email1, email2, email3, email4
then we’ve created structural flaws in our applications from which we may never recover. Our application have a limited useful life – until we bring it closer to the Normal Forms. But strict adherence may yield other problems. I get as close as I can to the Normal Forms, then I acknowledge… maybe not.