The Shape of Data

The Shape of Data

I can picture data, typically for me a table is cuboidal.

On the left side of the face are columns and columns of keys. The first column is the primary key. The second column is the most important foreign key. We have document storage tables that have 30 foreign keys – 1 key for each shape that the data takes. Then we have various columns for meta data and interesting stuff. Somewhere on the far left are the numbers. At the very far right of the face of my cube are the fundamental audit columns answering who and when updated/create the data row.

The data is dimensional. Even inter-dimensional. It is visual to me. When looking at the document table, I see only the invoice and customer data when I am studying invoices. Then if the task relates to inventory, different columns stand in the mental foreground. This is more obvious when create an entity relationship diagram for a mature and robust database system that has been fully normalized and has like data consolidated with like data. For example, regardless of the nature of a document, put all documents in one table. Not necessarily guidance for others, but it is our best practice. “Like with Like” is a good database rule. It is the first step in normalizing data. Put your client profile data together, but don’t fill it up with phone numbers and email. If you have a lot of phone numbers, then clearly to place “like with like”, you need a phone number or an address or an email table.

There is a process of identifying and recognizing the shape of data when talking with clients.

Invoice data looks a certain way and it has for decades. At a minimum it requires two tables. There is a table that represents the header and footer seen on paper documents. This invoice is for someone, valued at some amount, issued on some day, due on some other day. The taxes are X and the discount for paying on time is Y. Each line, normally get set into a table that track the specific of what is being invoice: line number, quantity, product, description, unit price, extended price, taxes and total.

Like Plato and his ideal plane, these abstracted forms exist within the mind’s eye. When we, database developers, listen to civilian clients we compare their expectations with the ideal data model for each structure: client, contact people, concert ticket, airline ticket, inventory part, tractor repair service call (which is just like a car service).

In the recent quarter, my team encountered a request (demand) from the client that read: Create this table for these data. This is a temporary solution that we will shortly replace.

I was out of the office that day. I didn’t read the ticket and I wasn’t in a meeting. My team said: “That’s stupid. The client is wrong.” By the end of the week, the client was furious. “You didn’t build me the table I told you to make. FAIL.” And the ticket returned to the to-do column.

By the way, I skimmed the ticket as well. All of the features the client wanted were completed except for this one table. All of the visual work and functionality existed and worked perfectly.

The table was to be a bizarro intersection between service data, document data, and a mixed bag of other information. No sense of “like with like,” but hey, we accomplished the mission without some non-sensical, misaligned, never-planned table that was to be “temporary” in a massive enterprise application that we’ve been building for years.

I was happy with the work.

Until the day the sprint ended.

Client said: “I need this table.”

“Oh, why is that. We met all the requirements.”

“Because we are uploading thousands of rows of data from an external system using that table next week.”

A courtroom lawyer would state that those were facts not yet in evidence. I lost my temper because one does on the last day of a sprint when new and urgent requirements are invented at 7 in the morning.

I should never lose my temper. I must always be a database developer and an analyst.

“What is this for?”

“What is the shape of the data we are bringing in?”

“How soon do you need this?”

For the client, the answers were correct and perfect. For me, the answers were wrong and out of sync.

I reviewed our tables and sitting in Plato’s ideal plane was the exact table the client needed. Wait, I am wrong, it wasn’t in Plato’s ideal plane, it was in our existing table structure. At the conception of the project, we drafted robust tables to meet 90% of the client’s needs based on the experience of building related systems.

We were asked to create a table that bridged the data between a field service order and the resulting report created by the technician. Here’s the report data. Here are the service order details. We recognized years ago, we need a table that matrixed and related service order to report result to a report document (PDF). We knew it needed to be authored and signed by a technician and linked with precisely one service order. It needed a pass/fail score.

It was intuitive, existing, and not-a-surprise requirement given it had been there since the first draft of our data table structure. It existed not because we had sample data or because the client stated it as a requirement, but because to develop a complete and robust data model, this piece needed to exist.

An experienced eye can look at the data saying: this service order report table needs to exist because its data are unlike other data. These data are more like to each other, and they don’t belong other places. Oh, and these data need hooks (keys) to all sort of related data such as client, service order, service order line, employee, the report, and the PDF document.

The developer tasked with planning then importing these new data didn’t understand our structure or he didn’t research. The client listening to one development team told another development team to execute on a specific task while obscuring or not revealing the motivation for the new table.

What is it to see the shape of data and interpret it?

Is the hero, the one who had the vision in the ideal plane? No, not if that vision wasn’t shared well enough. What if that table already existed and was named in accordance with standards? No, not if other’s didn’t see what they were looking for?

Data fits within a model and that model, when well executed, has a shape. Within that structure, there should be no missing pieces. It ought to resemble a completed jigsaw puzzle except in multiple dimensions.

My answer, now knowing a new business/technical requirement was to add a series of fields:

  • Data Source (imported, native, etc)
  • Import date
  • Imported source key/ID

I put a few database comment fields there like breadcrumbs. Maybe, possibly, when the other team sees these, they will populate these data points during import so we can provide an audit backwards into the legacy data.

These structures to data provide a common framework for construction of an enterprise system. It is an abstract blueprint.

Video killed the Radio Star

During my career, I have witnessed so many changes and happily rode the wave of innovation. I bought my first IBM PC in the year they were released. First mobile phone: late 1980s.  I worked as a field engineer at Cisco Systems during the massive expansion of the internet in the early 2000s. I have published 4 technical books about software development before my 30th birthday. I wrote my first database application in 1985. I designed and built my first enterprise software application with Oracle in 1996.

That’s me.

My grandfather, a nationally syndicated radio guy, once described television as a passing fade. His colleagues on the NBC Red Network (yeah, dials and tubes in a box) each signed contracts to join NBC television. These guys went on to be the black-and-white faces in the 1950s and 1960s. We celebrate these TV pioneers in movies and books.

My grandfather faded into obscurity with the A.M. radio. My grandfather was a radio star. During WWII, he had unlimited gas rations to facilitate his role as a news guy reporting on global war. He was dead before I was born, if you’re trying to do that math.

Daily, weekly, I explore if I am a pioneer or am I a radio star. My sistah-by-another-mistah bought a Compaq luggable the week it got released. We bought the first laptops, first mobile phones. There is scene of me sitting on a carpeted floor manually keying an IP address and mask into my computer before dialing into IP based bulletin board. In the late 1990s and early 2000s, I was using the internet protocol to exchange tsunami data between satellites and buoys floating in the mid-Pacific. My team built a deployed the first telemedicine system in Alaska during that same time. That earns me full marks in pioneer.

Then I have these voices in my head shying away from newer tech. Am I suddenly the Radio Star?

Until that moment when I have to stop my pandemic-era truck and reboot it repeatedly to keep it going. I have to reboot my truck to get the nonsense on my dashboard working. Weekly, when I drive the thing, it tells me that it is doing a software patch….Oops, the internet connection dropped in the rural mountains where I live. “Please do not disconnect”. Right, the truck assumes that mobile signal is ubiquitous. It isn’t. I served as a rural paramedic for decades. 90% of my town of 40 square miles has no mobile phone signal. In fact, I think 70% of the two southern counties in my state have no or limited mobile. I drive a truck that presumes that the internet is a viable, sustainable, always available element of public infrastructure.

Why do I need my truck to reboot? Ever? I have a 20-year-old diesel farm tractor. It is happy with some tax-free diesel and a few hours of maintenance per year. I plug it during cold winters so it starts with one or two turns of the ignition key.

I now live in a world where our own tech competes against itself. My truck desperately wants me to drive down the mountain and park in a lot for an hour allowing it to actually update software. I don’t care to go down the hill and frankly, I don’t care if my truck gets a software upgrade. That is until I tried to back towards a trailer in a busy farmyard. It engaged auto-breaks when tall grasses moved in the breeze. It freaked with alarms when I tried to back between 2 other trailers to get to the one I was connecting. Then did it ever freak when the trailer I plugged into did not meet the specifications within the existing digital definitions of trailers. Yes, my truck auto-braked and screamed with alarms when the weight of the trailer dramatically changed. I had to hack the system by disconnecting and reattaching the trailer’s wiring. Oh, right I unloaded a tractor. My truck panic and though the load was unsafe.

My truck broke a sensor while driving through a farm field in 4WD to check on a friends hogs (not motorcycles but pre-bacon critters). It spent 2, nearly 3 weeks at the shop because that failure generated a cascade of electronic failures that cost someone a lot of money to fix (not me, warrenty).

When did I become the anti-tech person? I am a tech pioneer. I am that soldier that says ‘follow me.’

There is a specific place for human, real, intelligence in our approach and employment of technology. My friend and neighbor, who’s husband was a Nobel laureate and Pulitzer winner finally quit teaching creative writing at a prestigious New England university when 100% of the papers she had to grade were generated by AI. I commiserated with her with genuine but distant sympathy.

Then I started encountering AI generated PL/SQL code in our application development. Geez,man I get it doing a rollup summary report without code references or a library can be challenging. But then it fails. In decades of teaching and leading teams of developers, suddenly, I am being told that this code is good because it was AI generated.

And yet, it fails. Sure, AI generated PL/SQL code will improve. We publish basic rules, standards, within our guidelines. Hey, validate your parameters. Who knows what other procedures or pages will call that code? Evaluate the quality of the data before, during and following processing. Data quality is a variable. And what Oracle programmer hasn’t had to learn the challenge of nulls within our data. Or had to manage the regional assumptions of NLS settings. Suddenly, you get errors about “mask terminated” when a date format doesn’t match the format of the date data presented.

During the recent years, we have written and supported an enterprise application that also generates invoices. As of last month, we exceeded 100K invoice since going live 18 months ago. Our error rate is below 0.01%. These errors are 100% due to poor quality data coming via an API from a legacy system. We are not permitted a math error. We cannot ball up the tax calculations. There are legal and regulatory impacts to mistakes at that level. My own firm has a commercial time/expense billing system that has significantly over 1M rows. And during our time supporting hurricane recovery in Puerto Rico, we used an Oracle database application to manage $5B in federal grant funding and over 400K PDF documents.

Know who is not allowed to make mistakes? Us.

Oracle makes these jobs a joy to work with. The power and the prestige of the Oracle brand and Oracle tech permit us to operate within these parameters.

A colleague offered great wisdom to me a decade ago when I called for help with a problem. He said, “Christina, there is no random code generator in your application. If your data is doing something, it is happening because you told it so.”

That’s right! I have the expertise to know, understand, and audit 100% of the data manipulations. I can be wrong in my thinking, my approach, or even my typing. Data we pull in from legacy systems can be wrong, incomplete, or just plain screwy. It is our job to fix or reject that data. But once in our system, the tools that manipulate our data were written by skilled software developers with years of training and skills.

There I am positing myself as a luddite. These are the words of a data pioneer, a tech pioneer.

The strength of an Oracle database is predicated on the legacy of database pioneering from the 1970s. The power, strength, and capabilities of PL/SQL sits in the very boring, very real, very stable, world of generating invoices, purchase orders, inventory systems, issuing airline tickets, tracking and balancing bank balances and credit card balances. I recognize the Radio Star in me when I question the role of AI in the mundane development of enterprise application systems that are used to manage public funds, public trusts, and public documents.

I try to recruit people into our world of Oracle database development and writing code in PL/SQL. It isn’t sexy. It isn’t glamorous. It is barely even in full color. It is a trade. We are the modern plumbers and modern carpenters. Our work is precision. Our work is elegant. Our work succeeds or fails based our own skills to comprehend business data and model it within an digital framework. Then when needed generate results that have an error rate that is as close to zero as possible.

Did I become my own grandfather who died before I was born? Did I become the crusty old guy that says, “back in my day?” And I the radio star?

There is a way of thinking that is database centric that we teach, and young developers need to learn. We need to instantly look at data and evaluate its fitness, quality, and normalization. We need to write code that works beyond the narrow scope of perfect data. It gracefully tolerates poor quality data, null data, data presented in unexpected formats.

Like the Fortran and Cobal programmers before me, I knew that Oracle PL/SQL is both contemporary and legacy at the same time. We are building massive systems that will live for decades and decades pushing dollars, documents, and data at the behest of humans. We must build systems that survive every audit. We must preserve data that can take the beating placed on it by certified or chartered accountants.

A PL/SQL developer stepping into the market today will be employed through their entire career if they honestly develop the technical skills to write outstanding, but boring, code that manages and manipulates massive data sets. I celebrate this.

Pioneer? Or Radio Star? You decide.

A note about the title:

“Video killed the Radio Star” was the very first music video that played on MTV at 12:01AM on 01 AUG 1981. In February of 2000, when I was at Cisco Systems, it became the One Millionth video to have played on MTV. Today, there are those who might ask: What is MTV?

About my grandfather

He wasn’t a complete looser, guys! He published 20 novels and numerous films. Although most of his films are lost to history due to the nature of cellulose film. My father published over 20 novels as well. And during the autumn of 2024, I have my first novel being published. That would be my 5th book.

The 1-2-3 of Data Normalization

Abstract

In the 1960s and 1970s, Dr Edgar Codd, a British computer scientist, developed a means of reducing data duplication with in relational databases. While six “forms” of the process exist, 1-4 ought to be at every database application developer’s finger tips. Fifty, nearly sixty, years have elapsed. Not only do these guidelines still apply today, when we forget them, we can cripple the flexibility and dynamic ability for software to grow and adapt.

Number Zero

The Number Zero of the data normalization forms is called: Unnormalized. While I would love to joke about this structure of data, we all encounter it just too often. Therefore, let’s acknowledge that it exists.

Movin’ on!

Data Table (a requisite element)

To normalize data, one must be working within a relational database environment, such as Oracle and such. The table must have zero, one, or more rows/records. Each row/record must have a primary key. The primary key must be unique. Quick checklist:

  • Data stored in tables
  • Tables have 1 or more columns (fields)
  • Tables have 0, 1, or more rows (records)
  • Each row has a unique primary key that uniquely identifies one row
  • Each row of data must relate to 1 “item” (topic/element/etc)

My Short Hand

  1. Model the world as it is, not as you want it to be. The world is messy. It is ok to use complex data structures to model complex data. “Complicated things are complicated”
  2. Think first: then add tables when needed; add fields when needed
  3. Look for lateral duplication
  4. Look for vertical duplication
  5. Look for non-data data (messy/junky stuff)

Lateral Duplication

Do not add a field called Contact2, ever.

Do not add a field called email2, ever.

Do not add a field called Address2, ever.

If you have multiple contacts, add a table for the contacts. One row for each contact.

If you find that your tables have columns duplicating sideways (laterally), then create a new table! Hey people have multiple email addresses. Create a table with a primary key and 1 email address. Validate the quality of the email to the best of the applications ability. Each new email address is a new row. Done.

This is a casual, but functional means of defining the First Normal Form.

Vertical Duplication

Look at your tables. Do you have important data that is duplicated by users going vertically in a table? You can see this if you have multiple clients called Acme Anvil. If you have five contact people named Wiley E Coyote, then you have an opportunity to improve the data model.

The solution tends to be new tables and breaking the data into smaller building blocks. When done well you work towards both the Second and Third Normal Forms.

Non-Data Data

When we ask users to type data free formed into fields, we tend to get junk. They enter “data” that is not-really-data. I am not talking about client names or personal names but all of those reason fields and note fields. That free-formed information. We mistype; we abbreviate; we get verbose; we get terse. In short, we get junk. Unless we use artificial intelligence or wicked complicate code, we lose the value of what the user is trying to tell us! Junk is junk!

If data matter to us, then we ought to use lookup tables and codes so that we capture what users need to track. It takes listening and tracking what they are doing. That is a great reason to talk with users and talk with your client. It is their data. We like good data. Good data has value. Bad data is bad.

First Normal Form (1NF)

To satisfy the First Normal Form of relational databases, then:

  • Each column(field) of a table must have a “single value”. This maybe a primary key, a foreign key, a client name, etc. When we store compound data within a field, we’ve stepped off the path. Sorry JSON.

How do we break this rule?

Let’s imagine a table with client data. The classic table has the unique primary key and the client name. Imagine needing an address. Let’s add an address column to the table. Into the column, we type the address: Spui 70 (boîte postale 12600),2500 DJ La Haye, Nederland (That’s the Hague). Suddenly, I need another address for the same location. So I add in a delivery address for heavy trucks. If I put a second address into the ADDRESS field, then I have eliminated the usefulness of the Address field. And I have not followed the First Normalization Form.

Another example with the client table, imagine a field for the email address of your contact person there. Store one address: compliant. Store two in that field and your are not compliant. Game over.

fred@flintstone.fake is fair

fred@flintstone.fake, barney@flintstone.fake in the same field breaks the rules.

Why is this bad?

Imagine that you need to email the client and you have two email addresses in the same field called EMAIL. If you send an email with one vendor, they want commas to separate. If you use the list with a different email vendor, they want a semi-colon between the two email address. You’ve lost control of the formatting, validity and the usefulness of the email address. It is useless data (unless you write code that parses it out.

Same with printing a mailing label for the Hague address above. You want precisely and exactly one address on a shipping label. If you ADDRESS field has two, then you are sunk. You lost the data game. No points for you.

Second Normal Form (2NF)

To satisfy the Second Normal Form (2NF) then:

  • Each row must have a unique primary key.
  • Rows ought describe unique data sets

Sure, yes, that is in the requisite steps listed above. Sometimes, we create data tables that require applications to always use multiple keys to find a desired row. In this situation, we use multiple or compound keys to access data.

Violating this rule happens all of the time. Can you envision the need to store a client in the client table that may have two division at two locations? Some people would create two separate clients each with one address. But are they the same client or one? We now have the name duplicated. In this row, we have ACME ANVILS

Primary KeyClient NameAddress
11235Acme Anvil123 Roadrunner Ave, WB
2358Acme Anvil123 Wiley Ln, WB
Second Data Normalization Form

We do not know if this is one client with two addresses or two clients. We failed to model the complexity of the real-world data. We must solve problems with multiple divisions and multiple addresses. The solution tends to involve adding tables that have greater specificity.

We are creating redundancy in the data going vertically. Which Acme Anvil do we need? How do we identify the one we want?

Why is this bad?

You want to place your finger on the required data elements with ease. If you don’t know which is which, then how will the application?

How to Fix?

Typically, we need to add fields and tables to more completely and more robustly model the data as it is, not as we want it to be. Yes, the real world is messy. Clients have divisions, multiple offices, multiple addresses, multiple contact people. Roll the sleeve up, figure out the data needed and how to relate it.

Do not be afraid view the data world as small data blocks that snap together like Lego bricks.

In this case, maybe we have a table for divisions. We create a table for addresses. Maybe we get super clever and add a client foreign key that references the primary key in its own table. This self-reference creates a tree structure or a hierarchical query. Super powerful for modelling data that nests within itself. With this technique, a client becomes a child of another client: which can represent a division within a client.

Third Normal Form (3NF)

Data typed by hand is:

  1. Flawed/error-prone
  2. In consistent

To satisfy the 3rd Normal Form:

  • Use Lookup Tables and lists to maintain consistency within the data

Often, when we are asking users to classify or identify the qualities of data, we are asking them to type something. We had a situation today, when a user knew that a client would have a tax ID, but did not yet have a tax ID. So the user typed: 0000.0000.0000 – which meant, “I put in a place holder”. It is not a valid tax id. The team failed to model this event in the lifecycle of client data. We use databases to organize and query data, often massive amounts of data. Consistency matters. Maybe there ought to be a field that says: Tax ID pending. Or maybe there is a lifecycle for the tax ID.

  • Client does not have a tax ID and does not need one
  • Client applied for a tax ID and it is pending
  • Client has a tax ID and it typed in
  • The system validated the tax ID against an external authority.

Now we have four states of the tax ID in addition to the tax ID number. We get more information.

To satisfy the Third Normalization Form, we should create a table for Tax ID status:

  • No Tax ID Required (NONE)
  • Tax ID Pending (PENDING)
  • Tax ID entered (EXISTS)
  • Tax ID validated with external authority (VALID)

We then use a look up table in conjunction with the Tax ID field to provide a more complete understanding of the data we are capturing. if a Tax ID is empty, we would have to guess. Is the Tax ID empty because the client does not need one? Or is it empty because the Tax ID is pending? If guessing, you’ll be right some times! 50-50, right?

When data become dependent on each other, we add tables and lookups to enforce consistency within the data set.

Had we asked users to type a reason for the absence of the Tax ID, we would see all sort of junk! We turned an opportunity to improve the quality of data into a mess. If you user selects from lists, then we are working harder to satisfy the Third Normal Form. My short hand version of this sounds like this: Use Look up tables to validate data.

Other Normal Forms

Data get complicated because, friends, our world is complicated. We can handle tables with multiple data dependencies. Normal Forms four, five and six to exist. And while I barely understand the terminology used in the academic literature and even on-line. Following these forms lead us into gloriously interesting structures. Child tables that matrix to multiple parent tables. Once playing in these deeper waters, peers tend to sit and discuss how to work through these issues.

Conclusion

When making decisions, we always make trade-offs. We violate the normal forms both accidently and on-purpose. In my podcast, The Soul of an Internet Machine, I discuss data normalization. In the late 1960s, people were not writing software that ran globally supporting 100 spoken languages across two-dozen time zone circling the earth. We did not create a means of supporting data in multiple languages simultaneously. If we duplicate the data laterally for English, French, Dutch, German, and Italian, we violate normal forms. If we create new rows, we may also violate normal forms (depends on the structure). If we store the phrases within JSON data inside of a field, we again violated the rules. So we plan and discuss.

When we add fields such as contact1, contact2, contact3, contact4 or even email1, email2, email3, email4 then we’ve created structural flaws in our applications from which we may never recover. Our application have a limited useful life – until we bring it closer to the Normal Forms. But strict adherence may yield other problems. I get as close as I can to the Normal Forms, then I acknowledge… maybe not.

Selecting Font APEX from a Database Table

Quick means of grabbing font apex from CSS text file For those of us that work databases applications from the server-side and feel less sophisticated with client-side activities, I wanted to keep an inventory of the icon available in Oracle APEX to users within standard database tables. Why would a software developer include fonts in a column? Imagine scoring with a star-based rating system, or the simplicity of thumbs up/thumbs down.

SQL Development in Multiple Languages

SQL Development must work on a global platform but only if we get the settings correct to display languages correctly in our code code. When an accent grave goes bad, it looks like an acute error (that’s a bad pun).