The shapes of data

It can be a bit of a pain to think about the various “shapes of data” available across languages. Often in learning a new language you might seek a “Rosetta Stone”. It’s valuable to be able to translate common shapes from one syntax to another. At the same time, it’s not always possible to make all of these translations cleanly. It can seem like some shapes are available in some places and not others.

Let me share my Rosetta Stone. It’s more fundamental, and thus widely applicable, than others. It’s also likely to be familiar to you because it’s exactly the same things you learned in high school algebra.

Note: This post is a reworking of an old post from an old blog.

What is data? What is a shape?

Data is the current state of the world. It’s static—immutable if you prefer—as even in a language which allows mutation at any single point in time the state of the world follows a particular shape.

We can separate “data” from the idea of “computation” which talks about change and effect. In most languages these lines are casually blurred. It may not always be completely clear where one ends and the other begins. I like to think that “data is anything you can hold in your hand”. It’s the things that exist right now. The state of the world may change, but at each snapshot the data is frozen as it is.

The shapes of data I want to share are compositional. We start from small pieces and then find ways to glue those pieces together to create more complex shapes. They then become a language for building up or taking apart the data you have.

The shapes of data

The shapes of data are 1, 0, X + Y, X \times Y, and X ^ Y.

These should look familiar. They’re some of the same components you learned as algebra. This is no mistake. The shapes of data are the same as the shapes of high school algebra. We’ll see later how to exploit this correspondence.

Actually, these shapes match intuitions we’ve built for many different kinds of things. Here’s a quick summary:

  • 1 is the trivial shape. It represents possibility and truth, it corresponds to void in C-like languages and is also called Unit or () in others, and, as you suspect, it works like the number 1 in mathematics.
  • 0 is the empty shape. It represents impossibility or falsehood, it corresponds to the result of calling a function that never returns and is sometimes (confusingly) called Void or ! or impossible. Finally, mathematically it corresponds to 0.
  • X \times Y is the “all together” shape. It represents combination and logical and and multiplication in mathematics. It’s wildly pervasive in programming languages as the notion of the struct, the object, the pair or tuple, or any time when you stick two pieces of data together at once.
  • X + Y is the “one or the other” shape. It represents choice or logical or, and in mathematics it corresponds to addition. It is also wildly pervasive in programming as the notion of choice that’s embedded in the booleans, in dispatch, in pattern matching, in if statements. That said, these sorts of things aren’t always something you can “hold in your hand”. Many languages make it difficult to talk about X + Y and instead only let you talk about its computational shadow. I’ll explain more below.
  • X^Y is the “mapping” shape. It represents logical implication and the notion of transformation from (confusingly) Y to X. So we often see it written as Y \rightarrow X or Y -> X and indeed it is represented by functions in programming languages.

Keeping this overview in mind, let’s dig in and explore each shape individually.

Unit, the most boring shape

The shape 1 is the shape of any single thing. By itself it conveys no data at all.

To a programmer it is most often known, confusingly, as void. A function in say C returns void when it has no interesting return value. But the thing to keep in mind is that a function which returns void actually does return. Compare that with a function like exit which immediately exits the program. That’s a different sort of behavior than returning boring value.

(Perceptive readers will note that stdlib.h gives the signature void exit(int status). But this is a lie. A function that never returns can claim whatever return value it wants: if exit claimed to return an int would you even know? We’ll see that the shape of this sort of data—nothingness—is 0 in a minute. C doesn’t have a type corresponding to 0, though, so it just picks an arbitrary stand-in.)

Unit is represented by 1 because if I give you a unit then you have exactly one, indivisible thing. It’s boring because if I give you two pieces of unit data then you can’t even distinguish them.

Unit is also important in another way. It lets us talk about “forgetting”. I’m always willing to give you a unit at any price. But it’s a bad trade; you lose everything you had. Unit is boring, free of all information.

Void, nothingness, the impossible shape

The shape 0 represents nothingness. I cannot give you a piece of data of shape 0. If you claim to have one then I know you are a liar. This shape of data has meaning through its absence, the void it leaves, and thus is perhaps better named void than 1 above.

In particular, void is the return type of functions which… do not actually return. They might loop forever, or exit the program, or always throw an exception.

This seems completely useless, but it conveys an important promise. There is no way to ever construct a piece of void data. Therefore, any way of producing it must be a lie.

For instance, imagine a shady merchant who promises to give you a piece of void if you give him, say, the Alchemist’s Stone. He cannot possibly have a piece of void. So, he must have great confidence that the Alchemist’s Stone does not exist either. Well, either that or he’s a liar himself and plans to skip town.

Or we can flip that situation around. I am completely happy to give you absolutely anything you desire for one piece of void. I feel confident in this because I am confident you can never pay that price.

More succinctly, a function from some input to void is a bet that such an input does not exist. They’re easy to build too: they just lie and do nothing. They’re safe in the understanding that they can never be called.

Connectives, ways of combining

Unit and void might feel mysterious when you learn about them on their own. They’re honestly too trivial to get a real handle on what they mean. To really understand, we have to compose them together into longer sentences, more sophisticated shapes of data.

Combination, multiplication, gluing together

The first connective for building shapes is X \times Y. To a programmer, this is the most natural tendency that underlies nearly all interesting data: it’s the notion of having an X and having a Y—at once!

In particular, structs, objects, hashes, lists, tuples, pairs, and even multiple returns are all instances of this phenomenon, this shape. It exists whenever two things are glued together. Though, it’s useful to look at the simplest example, the pair.

If I have an X and a Y I could give you either one individually, or package them together as (X, Y) and give you that bundle all at once. This is the honest-to-goodness core notion of X \times Y.

Products are completely pervasive and often the simpler they are the better. For instance, we can interpret the OO maxim of “composition over inheritance” as a desire to design objects as products of discrete pieces as opposed to a more sophisticated mix.

Products also capture the intuition of mathematical products. For instance, we can construct equations like 1 \times 1 = 1. This is mathematically true, but does it make sense in the world of data? How about 1 \times 0 = 0? Think about these for now and we’ll explore the idea more fully below.

Products also capture the notion of logical and. Since unit is truth, we can see 1 \times 1 = 1 as saying “true and true is equal to true”. Likewise, 1 \times 0 = 0 is the same as “true and false is equal to false”.

Choice, disjunction, and dispatch

The second major connective is X + Y. It represents the idea of “choice” or “either”. We can read, in terms of data, the shape X + Y as saying “either I have an X or I have a Y“. In particular, if I give you a piece of data of the shape X + Y I’ll give you either an X or a Y.

It’s critical to note that X + Y and Y + X aren’t the same shape. The notion of disjunction here is “tagged”. This is to say that when we come to understand whether we’ve gotten an X or a Y we also know which “side” it came from. If we didn’t know what “side” the value was on, then me giving you a value of type 1 + 1 would be me either giving you a 1 or a 1… which to your eyes would be the same as if I just had a 1 all along.

But 1 + 1 \neq 1. So we have to remember the “side”.

To see why this is important we can use what we’ve learned so far to construct our first interesting, named shape: the type named “boolean”

\mathbf{Boolean} = 1 + 1

Normally we think of booleans as being either true or false. If we didn’t distinguish the sides then we wouldn’t end up with a distinction between true and false because, as stated above, all units look identical.

So, choice is not merely “either what was in my left hand or my right hand” but also me opening that hand as I give it to you.

Visitors are a where you sometimes find choice

Unlike all of the other shapes, the shape of X + Y can be really subtle. Usually, choice is “lifted” from built-in types which represent choices. For instance, you can use a product to glue a boolean into a shape.

Other “built-in” choice shapes are integers, floating point numbers, and strings. In each of these cases, there is a list of possibilities and the data of that shape are the instances of that list.

In languages that don’t explicitly let you write your own choices, we can sometimes fall back to the visitor pattern. This might not be totally obvious, but a visitor is exactly what you might use to handle the situation where you’ve got, for instance, either a circle or a line or an arc. They represent a sum by noting that in order to use a sum we have to decide what to do for every possibility.

This is actually a fact we can read right out of a high school textbook, but first we have to see how to use the language of high school algebra to describe data.

High school algebra

With these four shapes we can start to write interesting things. For instance, we saw equations like 1 \times 1 = 1, 1 \times 0 = 0, and 1 + 1 = 2.

Okay, well, those are actually very silly and obvious statements when we just read them as math. The interesting part comes from reading them as statements about the shapes of data.

To make this correspondence, we have to figure out what “equality” means.

Two shapes of data are equal if they have the same “number of representatives”. To make this explicit, we can write |X| to be the number of representatives of a shape X. So, X = Y exactly when |X| = |Y|.

1 = 1 interpreted as data says “all units are equal”. 0 = 0 says that so are all voids. We note that 1 \neq 0 which says that unit is not the same as void. This is justified because while there’s one representative of the unit shape there are definitionally none for void.

We can read 1 \times 1 = 1 as noting that that there’s only one sort of pairing of two unit values. Similarly, 1 \times 1 \times 1 ... = 1 tells us that this property holds for any size of bundle.

On the other hand, 1 \times 0 = 0 is saying that if I want to pair together a unit and a void, well, I’m in trouble. There are no representatives of void whatsoever, so there is no way to build such a pairing.

Shapes with any number of representatives

Things get interesting when we include summation. 1 + 1 is the choice between two distinguished unit values and thus it has 2 representatives. We can even call that type 2 by conflating the number and the name of the type. It all works out because there’s such a tight correspondence between the two.

Or, to be totally explicit, we’ve been and can continue to use numbers both as actual numbers and as the name of a shape with that many inhabitants. So the shape 0 is the shape with zero inhabitants, the shape 1 has one. Boolean has two inhabitants and therefore is the shape 2. The shape 342 is some weird type with 342 representatives.

Now we get equations like 2 \times 2 = 4 which states that the pairing of two values of shape 2 is a shape with 4 inhabitants. Explicitly, we can see those data as

(true, true)
(true, false)
(false, true)
(false, false)

The final shape: exponentiation

The last shape I’ve not yet talked about is the shape of exponentiation. This is the strangest of the basic shapes. It represents functions, pure functions.

Specifically, the data of the shape X^Y are the ways of transforming Ys into Xes. (Note that this is reverse of what is easiest to read, X^Y is a function from Y to X not the other way around.) It creates the richest set of relationships between shapes we’ve seen so far and lets us use high school math to talk about all sorts of interesting things.

For instance, I said before that there are always functions from any shape X to 1, which we can write as the equation 1^X = 1. We can see that there are as many functions from 1 to a shape X as there are inhabitants of X by the equation X^1 = X.

We noted that the only way for there to be a function from some shape X to 0 is if there are no inhabitants of X to begin with. That’s just high school algebra as well: 0^X = 0 when X \neq 0 but 0^0 = 1 (at least sometimes).

We can show that X^0 = 1 which notes that there are trivial ways to turn void into anything else we want. We know we can hold that promise because we know that nobody can ever call our bluff.

Finally, let’s use high school algebra over exponentials to show that the visitor pattern is the same as a sum type. A visitor which visits three shapes, A, B, and C is a product of ways to use each of those three to produce some output shape R.

Visitor(A, B, C) = R^A \times R^B \times R^C = R^{A + B + C}

A visitor is just a function from a sum to some result.

The types of data

The shapes of data are 0, 1, X \times Y, X + Y, and X^Y. From these shapes we can build a rich universe of data shapes and even prove many useful relationships using just regular intuition for algebra.

In this essay I’ve consistently discussed the “shapes of data” as these shapes appear in all languages and contexts. They’re built from incredibly basic notions of existence, non-existence, togetherness, choice, and transformation.

That said, these shapes are also types. They’re usually called the algebraic data types after their strong relationship to standard algebra.

Not all type systems represent the algebraic data types directly. This is exactly what I was referring to when I mentioned the Visitor pattern. This is popular in OO languages where the type system prevents you from working with your own sum types. In languages with pattern matching, used-defined sum types are usually provided explicitly or are available through sealed hierarchies of types.

Richer shapes can also be constructed and are often available directly in languages. This is because the algebraic shapes can be too rudimentary or can imply bad memory performance. That said, the fundamentals expressed by the algebraic types are always there.

In that way, the algebraic shapes of data are the best Rosetta Stone for coming to grips with the shapes of data you see anywhere.