Tags
January 26, 2024
by
Alexander VerbruggenRead more about this author

What is data

Data is the most important commodity of the 21st century. Every company revolves around the data it has. Whether those are trade secrets to design some kind of proprietary system, data about customers or their behavior, data about the markets or sales, the list goes on and on.

Software applications revolve around data. Storing data, transforming data, exchanging data, interacting with data, it is at the core of every digital solution in one way or another.

Noise vs data

At the most basic level, data differs from noise because it contains something of value.

When you are talking, you are transmitting data via airwaves. A car muffler is using the same medium but does not convey anything of value, we label this as noise. One man's noise can be another man's data but that's a discussion for a different article.

Structured vs unstructured

Once we have established that data contains something of value, the question becomes: how can we extract that value? In the case of speech, our ears and brain transform the airwaves into electric signals that can lead to understanding and decision making.

However, from a computer perspective we would label this type of data as "unstructured". It is an analogue waveform with infinite precision. Even if we sample it into a digital signal with discrete precision, we would still consider it unstructured because it lacks syntax: there is no formal, documented way to parse this data into something the computer understands.

This is typical of data derived from analogue sources like video, images, audio,... or things like language which have evolved in ways that make rule-based analysis nearly impossible because there are more exceptions to rules than actual rules in any given language.

Suppose you want to get all the contract details for a particular contract number, the unstructured way to ask a computer could be:

Hey uh...computer...can you give me the latest info on contract A37..no wait, A389001?

Did you read that as "A-3-8-9-0-0-1" or did you arbitrarily group numbers like "A-38-90-01" or even "A-389-0-0-1"?

A structured way to ask the same thing could be using one of the many well documented formats like XML:

<contractNumber>A389001</contractNumber>

The fishtags and the trailing slash are all part of the syntax, making it trivial for a computer to get the two key aspects from this message: the word "contractNumber" and the actual value of that field: A389001.

Of course what is easy for a computer to parse and understand it not necessarily easy for a human to create with the precision required by a computer. That is at the core of application development: building ways for computers and humans to interact.

Artificial Intelligence

At this point it is important to highlight the significance of AI, especially in the last few years. Classic AI is, much like our brains, incredibly good at pattern recognition. While in single instances it generally does not perform as well as our brain, it more than makes up for it in volume and complexity. You can categorize a single image better than a computer, but when it comes to categorizing millions of images, the computer will always win.

In the same vein, when figuring out the pattern in a limited set of numbers, our brains will usually win, but given a big enough data set our brains can no longer cope while the computer actually thrives on bigger datasets.

This pattern recognition power can be used to extract the value from seemingly unstructured data. For example given enough language samples, the computer can figure out the language patterns allowing it to deduce which words are significant and how. Given enough examples of cat pictures, it can quickly find cats in millions of images.
Even if you have a large amount of already structured data, AI can be used to extrapolate new insights you were not aware off by noticing patterns hidden in the dataset.

However, this process is so CPU-intensive that we have switched to dedicated hardware rather than doing this on general purpose computers. Even with all that processing power, it can result in incorrect data, which means you should only go this route if there is no other way to extract the value from the data.

Even if you could add the necessary error checks to ensure that the end result is always correct, the amount of power needed to get to the result is vast in comparison to structured data.

Data is only valuable once we get the value out, structured data simply makes it a lot easier to extract that value and use it in other processes. That means AI analysis of unstructured data is mostly valuable to get to some form of structured data that then feeds into other systems. This means it often sits at the periphery of structured data systems, allowing them to work with data that would otherwise be worthless to them.

Two dimensions

When we talk about structured data, there are actually two dimensions to that structure.
The one we highlighted above is the data format, it is the lexical conformity that the data must adhere to so other systems can easily parse and understand the data. It defines the seperators or patterns that a computer can look for to determine where one piece of data stops and another begins. Examples of this are JSON, CSV, XML, EDI,...

The other dimension to structured data is defining what it is actually trying to convey. While both these examples use the same format (XML), they talk about entirely different topics:

<book> <isbn>123</isbn> <author>John Smith</author> </book> <company> <vat>BE123</vat> <name>Example Company</name> </company>

While the data format is relevant to parsing the data into something the computer understands, the data model details what it actually means and is relevant to using the data in business logic. While you can do some neat technical tricks on data where the data model is unknown, there is very little useful business logic that can be built on a data with an unknown model because at some point, you need to know how the data you have relates to the problem you are trying to solve.

Data Format

There are tons of standardized data formats and an infinite amount of possible custom formats. Even though each is a unique solution for a particular problem, there are some broad categorizations that can be made.

Readability

The initial primary distinction is often between binary and textual formats. A textual format is where data is written in a way that can be easily parsed by computers but still read by humans. A binary format however is entirely aimed at computers and can not be read by humans without additional tooling.

Binary formats are useful for three things:

  • storing unstructured data like images, video,...
  • they can be more space efficient, so they can be interesting in scenarios where byte storage or transmission is at a premium
  • they obfuscate the data that they contain, allowing for proprietary formats where people might have the data physically on their computer but can't read it without a specific tool

Most data that humans care for, is textual in nature: we care about numbers, names, taxonomy, relations,... This means that textual formats are usually a natural fit.

While Nabu has support for binary formats, in our day-to-day we mostly deal with textual formats which is why I'll focus on those.

Dimensionality

A second major distinction in data formats is how many dimensions it can fit.

Let's take an excel file for example, it is extremely good at two-dimensional data: rows and columns. It does not allow for something more complex though.

Suppose we want to express the following data:

  • a list of dog owners
  • each owner has a list of dogs
  • each dog has a list of his medical checkups

In an excel file you would need three sheets with cross references between the sheets.

Where excel is a binary format, CSV is a textual format that is interchangeable with excel files. It does not have the concept of sheets though which means you need three different csv files:

# id, firstName, lastName
1, John, Smith

# id, dogName, ownerId
1, Sparkie, 1
2, Jack, 1

# id, checkup, dogId
1, 6 month vaccination, 1
2, yearly checkup, 2

Alternatively you could squash the data into a flat list with repetition of the shared parts:

# firstName, lastName, dogName, checkup
John, Smith, Sparkie, 6 month vaccination
John, Smith, Jack, yearly checkup

Neither is all that interesting. CSV does have the advantage of being one of the most efficient textual formats which is why it is often used in to express larger datasets.

In a hierarchical format like JSON we can express the data in a way that better matches our original intent:

[{
    "firstName": "John",
    "lastName": "Smith",
    "dogs": [{
        "name": "Sparkie",
        "checkups": [
            "6 month vaccination"
        ]
    }, {
        "name": "Jack",
        "checkups": [
            "yearly checkup"
        ]
    }]
}]

Owners that have dogs which have treatments. The "has" relationship between the objects is of course important here.

Compactness

Early data formats had a much less mature infrastructure to deal with. Internet was still in its dial-up phase while disk size was measured in megabytes rather than terrabytes. In those days every bit mattered. These days byte efficiency is still important, but -unless you are working in a very limited environment like IOT- not the most important metric anymore.

As mentioned before, binary formats can be more efficient than textual formats, but a really lean textual format doesn't have that much more overhead.

A broad category of efficient textual formats is sometimes referred to as "flat files". It counts on a combination of separators and potentially positional data to format data.

CSV can be seen as a sort of flat file:

1,John,Smith
2,Jane,Doe

CSV is a separator-based flat file: we use linefeeds as the record separator which identifies where one object stops and the next one begins.
We use commas as the field separator which indicates within a single record where one field ends and the next begins.

This data format assumes that you have agreed on which fields exist in an object and in which order they will be defined. This is not enforced by the format.

Note that over the years, the name CSV has been used for a lot of formats that look vaguely similar meaning a CSV parser these days has to contend with a lot of different formats.

In a positional flat file, you could do this:

001    John   Smith002    Jane     Doe

We agree that a record is always 19 characters long and then:

  • the first three are the id, prefixed with leading 0 if not a three digit number
  • then 8 spaces for a first name, padded with whitespace as needed to reach that length
  • then 8 spaces for the last name, again padded with whitespace

Fixed length has the distinct advantage that you can instantly jump to a specific place. You want the last name of record 100583? You can just calculate the position and jump to it, rather than having to parse all the intervening records to know where you need to be.

The tradeoff is that it is less space efficient and potentially highly restrictive.

One of the problems with flat files is that you can not parse them without a proper definition. Just receiving the data is not enough to parse it into something that can potentially be used.

As optimal byte size became less important, enterprise applications slowly shifted to XML which enforces a particular structure to the format and is more readable to boot. An XML file stands on its own and can be understood by a computer without any further definitions:

<owners xmlns="www.example.com"> <owner id="1"> <firstName>John</firstName> <lastName>Smith</lastName> </owner> <owner id="2"> <firstName>Jane</firstName> <lastName>Doe</lastName> </owner> </owners>

This capability comes at a cost though, we have increased our size from the original 23 characters in the CSV to 210 in XML.

XML became especially popular due to SOAP webservices. In modern applications, SOAP has been largely supplanted by REST which can be used with XML but is more often used with JSON. This is in no small part due to the rising popularity of javascript.

Binary content

One area where textual formats generally fall flat is when trying to send binary data. Suppose in our dog owner example we wanted to include a picture of our owner in the dataset itself.

Textual formats work partly by using special characters that are typically not part of standard human text. Where textual formats are counting on you being human and using only a limited range of byte combinations to express yourself, binary data has no such qualms. Any and all byte combinations will occur in a binary data set. This means it will overlap with special characters confusing all efforts to parse it.

Instead, embedding binary data within a textual format requires encoding it. By far the most used is base64 which translates arbitrary binary data into the more limited range of human symbols. This does however make the data 33% larger. This is of course a rather big overhead on data that might already be big.

For this reason binary content is often transmitted seperately from textual data and simply referenced.

Large data

No matter how compact your data format is, people will generate massive files with it.

Suppose we take that owners example but instead of two owners, we have 20 million owners. Roughly extrapolated from the example the resulting XML file would be 1.6GB.

If you were to naïvely ask a library to parse that XML file, it would explode your memory usage, possibly crashing your application. What if its 16GB? 100GB?

Dealing with big files in any format is always an additional challenge. The good thing is, big files are generally only ever big because of one or more repeating elements. It is not one massive blob of information, it is a lot of iterations of smaller bits of information.

Nabu comes with "windowed" parsing where you can configure one (or more) locations where large amounts of repeating elements are expected. You can then set a batch size and a window size determining how many items it keeps in memory at any given time and how many it parses in one go.

While you can technically jump back and forth in even big files using this approach, it is obviously costly to re-parse items that are no longer in memory when you jump back for instance.

When dealing with large files, the best thing you can do is set up windowing, loop through the data in the original sequence and set up a batch of smaller jobs that deals with smaller portions of the file. Instead of processing 20 million owners sequentially in a single thread, you can process owners in batches of say 10.000 and even do that in parallel to speed up the overall process.

Nabu supports windowed parsing for a number of formats: XML, CSV, flat files,...

Data model

As we saw, early data format needed a separate definition to be able to parse them at all. Newer data formats have all the necessary metadata so a computer can parse them automatically without any further information.

So what do we need a data model for?

While a computer might be able to automatically parse our owners XML file, it has no clue which fields should be there nor what they mean.

It will happily parse this:

<owners xmlns="www.example.com"> <owner id="1"> <firstName>John</firstName> <lastName>Smith</lastName> </owner> <owner id="not-a-number"> <foirstNam>Jane</foirstNam> <dog>Doe</dog> </owner> </owners>

While technically valid XML and fully parseable, it is...not very useful. The data model allows us to tell the computer which fields should be there and what they should look like. At the same time it is used to inform people of which data they should expect.

Tooling

There are many tools and standards to define data models, for example XML Schema, JSON Schema, UML... While their naming might imply an explicit link to a particular data format, it is not that clear cut.

The actual model you create in XML Schema is persisted as XML and it is generally used to model XML files but that is not really an inherent limitation. You could create a model in XML Schema and apply it to a JSON file for instance.

Let's have a look at what a model of our dog owners could look like in XML Schema:

<schema xmlns="http://www.w3.org/2001/XMLSchema" xmlns:tns="example.com" targetNamespace="example.com"> <complexType name="Owner"> <sequence> <element name="firstName" nillable="true" type="string"/> <element name="lastName" nillable="true" type="string"/> </sequence> </complexType> <element name="owner" nillable="true" type="tns:Owner"/> </schema>

In JSON Schema we might get:

{
    "type": "object",
    "required": [
        "firstName",
        "lastName"
    ],
    "properties": {
        "firstName": {
            "type": "string"
        },
        "lastName": {
            "type": "string"
        }
    }
}

These data modeling notations are described using a computer-readable data format and follow their own data model rules.

You don't really need to worry too much about the exact notation of these schema's as you'll rarely need to edit them by hand. There are tons of tools available that will give you a graphical environment to model your data, abstracting away from the notational details.

Nabu comes with an abstraction layer over specific data modeling languages and maps existing models to that abstraction layer. This means Nabu has support for a number of different modeling formats:

  • XML Schema
  • JSON Schema
  • UML
  • Java beans
  • ...

But in general we will use our own internal modeling tools when creating new models from scratch to allow for maximum use of Nabu-specific features.

Two Types

In data modelling we generally distinguish between two different data types:

  • simple type: an atomic piece of information like a number, a date, a piece of text,...
  • complex type: a composition of simple types and/or other complex types

In our owner example, the firstName and lastName fields would be simple types while the owner wrapper element is a complex type.

Because a string is any random sequence of characters, almost all data we use in our day to day lives is at the very least a string.

However, some of these strings have further restrictions imposed on them because they are conveying something more specific than free text, for example a date or a number.

In human writing we already agree on specific formats to convey data to avoid confusion, although geographic differences in formats can still lead to misunderstandings. For example in the US, most people will write month/day/year while in Europe we write day/month/year. Of course to avoid confusion things like this are standardized in data models.

Specificity

By adding properties to our elements, we can specify exactly what we want. For example we want a number instead of a string. But we can be even more specific. If we are capturing an age, we might want to make sure a negative number is never allowed.

If we want to limit the first name to include only a specific set of characters, we could configure a regex pattern to ensure that no special characters are used. We could limit it to maximum 128 characters and require at least 1 character.

As a general rule: the more specific you can be, the better. You can always loosen restrictions later on if you need to support other usecases, conversely once you have agreed upon a specific data model, it is very hard to make it more strict after the fact because that would require all parties involved to adhere to the more strict rules.

For example: if a field is mandatory at the outset, that means all third parties sending you data will fill it in. If you make it optional later on, the data being sent by the third parties is still valid. This is what we call a backwards compatible change to a data model. However if that field was optional at the outset, some third parties might not send it along in some usecases. Suddenly making it required means that data will no longer be valid, this is a non-backwards compatible change.

When evolving data models, we strive to make only backwards compatible changes. In general this means loosening restrictions on existing fields or adding optional fields.

You can easily validate if it is backwards compatible by asking yourself: is all the data that is generated against the old model valid against the new model?

Limits of specificity

Most validation is performed on a single simple type. The pattern you set on a string only applies to that field, not its sibling field.

Sometimes you want to create more complex validations, for instance: if field A is filled in then field B is mandatory. These cross field validations are typically harder to define in a standard data model.

There are some specifications like schematron that try to capture rule-based validation that can include multiple fields, however they are rarely used.

More complex validations are often implemented in the "business logic". It's not always clear where validation rules end and business logic begins.

Take a numeric value. It could be easy to simply state that it should never be below zero. But what if the numeric value is a dollar amount for a payment and we actually want to check that the number is never larger than your current bank balance sitting in a database record? What if some customers are allowed to go into the red on their account, but only up to a specific amount? Some payments might need to be flagged for manual inspection, some might require additional verification from an account holder,...

Two general guidelines that can help you decide if it belongs in validation or business logic:

  • When the actual validation rule is dynamic or the consequences to failing a validation are dynamically decided, it falls squarely in the domain of business logic
  • Validation rules should say something about the absolute nature of the value rather than how it will be used.

As an example of that second guideline: you could state, with quite a bit of certainty, that a value containing "degrees Kelvin" should never be below 0. That's an intrinsic limitation on the frame of reference you are using. If however, the field only contains "degrees" and another field indicates a variable unit (e.g. celsius, fahrenheit,...), the reference frame is dynamic and it will become a lot harder to set viable limits on the degrees field.

Purpose

Data modeling can be used for a number of purposes but I will focus on two major ones:

  • modeling your API's
  • modeling your database

There is a big difference between these two.

API's

API's focus on offering a particular view on a given dataset. Each endpoint has a specific purpose tailored to the needs of the consumers.

We could for instance offer an API endpoint where we can fetch the dog owners with the necessary details:

[{
    "firstName": "John",
    "lastName": "Smith",
    "dogs": [{
        "name": "Sparkie",
        "checkups": [
            "6 month vaccination"
        ]
    }, {
        "name": "Jack",
        "checkups": [
            "yearly checkup"
        ]
    }]
}]

We might offer a second API endpoint where we fetch doctors with all the dog that they've treated:

[{
    "firstName": "Jane",
    "lastName": "Doe",
    "type": "docter",
    "dogs": [{
        "name": "Sparkie",
        "checkups": [
            "6 month vaccination"
        ]
    }]
}]

Each API has a particular view on a shared dataset and makes some assumptions as to why you are requesting this particular data.

Database

API's build upon a persisted dataset that lives somewhere. Where an API is focused on a particular view of the data, the persistance layer can not afford to make such assumptions, it must offer maximum flexibility to combine the data in as many different views as possible.

In our above example, both the owner and doctor API returned a shared object: the dog. From an API perspective this is perfectly acceptable, but how would you store this information in your database?

Simply storing the JSONs as outlined above would end up duplicating the dog with no easy to way to know that they are actually referring to the same dog.

You could decide to make one of them the actual owner while the other has a reference:

[{
    "id": 1,
    "firstName": "John",
    "lastName": "Smith",
    "type": "owner",
    "dogs": [{
        "id": 1,
        "name": "Sparkie",
        "checkups": [{
            "id": 1,
            "checkup": "6 month vaccination"
        }]
    }, {
        "id": 2,
        "name": "Jack",
        "checkups": [{
            "id": 2,
            "checkup": "yearly checkup"
        }]
    }]
}, {
    "firstName": "Jane",
    "lastName": "Doe",
    "type": "docter",
    "dogs": [{
        "ownerId": 1,
        "dogId": 1,
        "checkupId": 1
    }]
}]

But this becomes a fragile mess very quickly.

Instead, at the database level we normalize the data which means: we flatten it. In our example we have 4 objects:

  • owners
  • dogs
  • checkups
  • doctors

Instead of trying to define some hierarchical relation between them and store them as such, we give them each a flat 2-dimensional table.

This approach allows us to add links in all directions. Doctors can be linked to checkups. Checkups can be linked to dogs. Dogs can be linked to owners.
We can now approach this data from any direction and get the answer we want.

In the future as we add new concepts to our model, we can add additional links without disturbing the data that exists.