Tags
October 17, 2023
by
Alexander VerbruggenRead more about this author

Diffing

The goal of diffing is to determine the changes that occurred between two states of a particular data set.

The best known example is textual diffing where we determine which lines were added, removed and updated. For example if we have these two texts:

This is the first line
This is the second line
This is the first line
This is the third line

We can ask a line diffing tool to generate a standardized diff:

2c2
< This is the second line
---
> This is the third line

A diff is mostly useful in combination with a patch tool that can actually apply the changes on top of a dataset to go from one version to another.

Especially when dealing with large files, it can be a lot easier to keep track of changes rather than full copies of each version. You can even apply changesets in different combinations that may lead to a document version that never actually existed before then. This allows you to for instance merge multiple branches that were concurrently created by different people.

The algorithm used for determining the diff depends on the data set in question. For text you could use the longest common subsequence algorithm for instance.

If you have structured data however, especially with hierarchical data and arrays, line diffing of a serialized version might get you some answers, but better results can be obtained if you use algorithms that are aware of the data structure.

The order of things

When diffing complex objects, one particular challenge is the comparison of arrays.

In a naive approach, we could perform indexed based diffing: assume that index 0 in the first object must be compared to index 0 of the second object.

However, if your change happens to be delete index 0 of the second object, you might be comparing against the next item in the list which will likely register as a massive change. This of course propagates throughout the entire array.

Ideally you want a way to identify each record in an array so you can find the corresponding match and diff those. In this particular algorithm I assume that you indicate some field as being the primary key of an array. This will be used to find the corresponding match.

Expressing change

When you are calculating the diff, you can wonder whether the second change should take into account the first change.

For instance if our first change concludes that you have deleted index 0 of an array, then we proceed with diffing index 1, should any changes in that be reported as index 1 (which is compared to the as is without any applied change), or already take into account the delete and report as index 0?

The same applies for when adding things to a substantially changed array. Suppose you perform some deletes and inserts, what should be the reported index of those inserts? The one in the target array? The source array length minus any deletions? Or assume all insertions after the source array latest index without any deletions?
You could opt to not report an index at all and just append the data.

These sorts of design decisions can impact readability of the result set but are especially relevant for the patching tool because whatever assumption you make in reporting, it has to apply correctly when attempting to apply the full changeset.

Partial diffs

Sometimes you want to diff two objects of a different type, a common usecase is that the second object is actually a restricted view of the first one.

If we were to simply check for all differences, all the restricted data would be flagged as deleted.

Conversely any additional fields in the second document would be flagged as new data but it could not be successfully merged into the first document because it has no knowledge of these fields.

Instead we want to limit our diffing to fields that are available in both documents as these are the only fields that we can have any useful conclusion for.

The diff script

This glue script allows you to pass in two complex instances and it will report the differences in a structured resultset that looks like this:

Diff result structure

The script looks like this:

original ?= null
new ?= null

originalTypeId = nabu.utils.reflection.Type.of(original)/typeId

# Get all the properties, we need to know the unique keys at each level to perform proper diffing in case of lists etc
properties = nabu.utils.reflection.Type.describe(originalTypeId, true)/parameters

newTypeId = nabu.utils.reflection.Type.of(new)/typeId

# If they are not of the same type, we only compare the fields that they have in common
if (newTypeId != originalTypeId)
    newProperties = nabu.utils.reflection.Type.describe(newTypeId, true)/parameters
    properties = properties[path ? /newProperties/path]

differ = lambda
    original ?= null
    new ?= null
    # the path so far
    path ?= null
    # the path without any array indexes etc
    flatPath ?= null

    localProperties = when(flatPath == null, properties[path ~ "^[^/]+$"], properties[path ~ /flatPath + "/[^/]+$"])
    
    @return
    diffs = series()
    
    # we loop over all properties so we check all values (not only values that happen to exist in the original)
    for (property : localProperties)
        childPath = when(path == null, "", path + "/") + property/name
        childFlatPath = when(flatPath == null, "", flatPath + "/") + property/name
        
        oldValue = original[/property/name]
        newValue = new[/property/name]
        
        # if it's a list, we can have inserts and updates etc
        if (property/list)
            listPrimary = first(properties[path ~ /childFlatPath + "/[^/]+$" && primary == true])
            if (!listPrimary)
                throw("Could not find primary key in loop " + property/path)

            # keep track of the ids we already checked to conclude inserts
            ids = series()

            for (single : oldValue)
                iterationPath = childPath + "[" + $index + "]"
                primaryKey = single[/listPrimary/name]
                ids = merge(ids, primaryKey)
                # we need to find the equivalent entry in the new value
                equivalent = first(newValue[$this[/listPrimary/name] == /primaryKey])
                # it was deleted!
                if (equivalent == null)
                    diffs = merge(diffs, structure(path: iterationPath, action: "delete", oldValue: single))
                else
                    diffs = merge(diffs, differ(single, equivalent, iterationPath, childFlatPath))

            # the insert counter starts after the last element in the old value
            # this does not take into account deletes however, so if you apply both inserts and deletes the actual index might change
            insertCounter = size(oldValue)                    
            for (single : newValue)
                primaryKey = single[/listPrimary/name]
                # if it's not in the ids, its in insert
                if (primaryKey !? ids)
                    diffs = merge(diffs, structure(path: childPath + "[" + insertCounter + "]", action: "insert", newValue: single))
                    insertCounter = insertCounter + 1
                
        else if (property/simple && oldValue != newValue)
            diffs = merge(diffs, structure(path: childPath, action: "update", oldValue: oldValue, newValue: newValue))
            
        else if (!property/simple)
            # new value
            if (oldValue == null && newValue != null)
                diffs = merge(diffs, structure(path: childPath, action: "update", oldValue: oldValue, newValue: newValue))
            # removed value
            else if (oldValue != null && newValue == null)
                diffs = merge(diffs, structure(path: childPath, action: "update", oldValue: oldValue, newValue: newValue))
            # dig deep
            else if (oldValue != null && newValue != null)
                diffs = merge(diffs, differ(oldValue, newValue, childPath, childFlatPath))

@return
nabu.frameworks.cdm.utils.diffResult [] diffs = differ(original, new)

Example

Assume the following test data structure. Note that the contact array uses email as its primary key:

Company Structure

We have these two instances that we want to diff:

{
    "name": "Company1",
    "address": {
        "street": "testStreet1",
        "number": "35",
        "city": null,
        "postCode": null
    },
    "contacts": [
        {
            "email": "user4@example.com",
            "firstName": "John",
            "lastName": "Smith Again"
        },
        {
            "email": "user1@example.com",
            "firstName": "John",
            "lastName": "Smith"
        }
    ]
}
{
    "name": "Company2",
    "address": {
        "street": "testStreet2",
        "number": "35",
        "city": null,
        "postCode": "2000"
    },
    "contacts": [
        {
            "email": "user3@example.com",
            "firstName": "John",
            "lastName": "Not Smith"
        },
        {
            "email": "user2@example.com",
            "firstName": "John",
            "lastName": "Other Smith"
        },
        {
            "email": "user1@example.com",
            "firstName": "John",
            "lastName": "Smooth"
        }
    ]
}

The resultset that the algorithm gives us:

[
    {
        "path": "name",
        "action": "update",
        "oldValue": "Company1",
        "newValue": "Company2"
    },
    {
        "path": "address\/street",
        "action": "update",
        "oldValue": "testStreet1",
        "newValue": "testStreet2"
    },
    {
        "path": "address\/postCode",
        "action": "update",
        "newValue": "2000"
    },
    {
        "path": "contacts[0]",
        "action": "delete",
        "oldValue": {
            "email": "user4@example.com",
            "firstName": "John",
            "lastName": "Smith Again"
        }
    },
    {
        "path": "contacts[1]\/lastName",
        "action": "update",
        "oldValue": "Smith",
        "newValue": "Smooth"
    },
    {
        "path": "contacts[2]",
        "action": "insert",
        "newValue": {
            "email": "user3@example.com",
            "firstName": "John",
            "lastName": "Not Smith"
        }
    },
    {
        "path": "contacts[3]",
        "action": "insert",
        "newValue": {
            "email": "user2@example.com",
            "firstName": "John",
            "lastName": "Other Smith"
        }
    }
]

Partial diff example

Let's now run a diff where we start from the same company as our source object but we diff against an enriched restricted derivative with the following definition:

Enriched restricted derived company

We use the following content:

{
    "name": "Company1",
    "address": {
        "street": "testStreet1",
        "number": "35",
        "city": null,
        "postCode": null
    },
    "contacts": [
        {
            "email": "user4@example.com",
            "firstName": "John",
            "lastName": "Smith Again"
        },
        {
            "email": "user1@example.com",
            "firstName": "John",
            "lastName": "Smith"
        }
    ]
}
{
    "name": "Company2",
    "address": {
        "street": "testStreet2",
        "number": "35",
        "city": null,
        "postCode": "2000"
    },
    "contactEmail": "contact@example.com"
}

Because of the difference in definition we don't actually compare the contacts from the source object nor the contactEmail that only exists in the target object. The diff now becomes:

[
    {
        "path": "name",
        "action": "update",
        "oldValue": "Company1",
        "newValue": "Company2"
    },
    {
        "path": "address\/street",
        "action": "update",
        "oldValue": "testStreet1",
        "newValue": "testStreet2"
    },
    {
        "path": "address\/postCode",
        "action": "update",
        "newValue": "2000"
    }
]

Patching

As mentioned before, a diff is only really useful if you have a patching script that can apply the resulting changes. The following glue script will apply a set of diffs to a given object instance:

instance ?= null
nabu.frameworks.cdm.utils.diffResult [] diffs ?= null

# We first apply all the updates, if there are updates to array based items, the indexes should be correct until we actually start deleting and inserting
for (diff : diffs)
    # if it's an update, we straight up set it
    if (diff/action == "update")
        instance[/diff/path] = diff/newValue

# Then we perform all deletes
# The array that it impacts should always be at the end of the path
# However it is not guaranteed that they are in order, we might delete index 2, then 1, then 10. however, the deletes impact one another so we can't just execute them one after the other
# Per impacted array we collect a list of indexes that should be removed
# We then remove them all in one go
arraysToCheck = structure()
for (diff : diffs)    
    if (diff/action == "delete")
        arrayPath = replace("(.*)\[[0-9]+\]$", "$1", diff/path)
        arrayIndex = replace(".*\[([0-9]+)\]$", "$1", diff/path)
        arraysToCheck[/arrayPath] = when(arraysToCheck[/arrayPath] == null, series(arrayIndex), merge(arraysToCheck[/arrayPath], arrayIndex))

# We sort the arrays from longest name to smallest
# This is an easy workaround for "nested arrays" where deleting a parent might impact the child
# For a diff done between two documents this should never happen, for a multiway diff however it could be the case
for (arrayPath : sort(lambda(a, b, size(b) - size(a)), keys(arraysToCheck)))
    instance[/arrayPath] = filter(lambda(x, i, i !? arraysToCheck[/arrayPath]), instance[/arrayPath])
    
# We then need to add any inserts, the indexes are incorrect if any deletes have happened, so instead we just append at the end
for (diff : diffs)    
    if (diff/action == "insert")
        arrayPath = replace("(.*)\[[0-9]+\]$", "$1", diff/path)
        instance[/arrayPath] = when(instance[/arrayPath] == null, series(diff/newValue), merge(instance[/arrayPath], diff/newValue))

On the left is the patched company from our full example, on the right is the patched company from our partial example:

{
    "name": "Company2",
    "address": {
        "street": "testStreet2",
        "number": "35",
        "city": null,
        "postCode": "2000"
    },
    "contacts": [
        {
            "email": "user1@example.com",
            "firstName": "John",
            "lastName": "Smooth"
        },
        {
            "email": "user3@example.com",
            "firstName": "John",
            "lastName": "Not Smith"
        },
        {
            "email": "user2@example.com",
            "firstName": "John",
            "lastName": "Other Smith"
        }
    ]
}
{
    "name": "Company2",
    "address": {
        "street": "testStreet2",
        "number": "35",
        "city": null,
        "postCode": "2000"
    },
    "contacts": [
        {
            "email": "user4@example.com",
            "firstName": "John",
            "lastName": "Smith Again"
        },
        {
            "email": "user1@example.com",
            "firstName": "John",
            "lastName": "Smith"
        }
    ]
}