Magic Runes and Sand Dunes: The Binary Classification Theory of Knowledge
It won't replace klocs, but it's a start.
I. The Unreasonable Effectiveness of Computers
Why are computers useful? Possible answers might include:
they do math.
they follow instructions.
they store information.
they communicate.
they entertain.
etc.
And sure, these are all useful benefits of a general-purpose computer. But that's not quite what I'm interested in. Rather, I'm interested in the mechanism or principle. It seems to me like there ought to be some single, unifying reason why computers are useful. So, what is that single, unifying reason? What does it really mean to "compute"? What's the true nature of computation? What's it's superpower?
It's a tough question. A modern computer is quite a complex object, between the MOSFET's, the logic gates, the bits and nibbles, etc. Layers upon layers of indirection.
For starters, two things that come to mind are logic and information. One could observe that the logic gates are used to manipulate information. Then perhaps "computation" consists of manipulating information via logic. But this kicks the can to two related questions of mine: what is "information"? (Also, why is it so closely associated with computers?) Claude Shannon supposedly described its behavior in math, but even he did not know what to make of his own formula.
II. Numerals grow on trees
One observation of mine is that software uses an awful lot of trees. The hardware, of course, prefers everything be organized in contiguous arrays. But in an ideal world where memory was infinite and instantaneous and came with no downsides, software would organize everything into trees. It's simply the most natural way of organizing information hierarchically. One surprising insight is that not only is the information often organized into trees, but even the information per se is made of trees! Let me explain.
If there's one thing I've learned about computers from Hollywood, it's that computers run on 1's and 0's. A lone 1 or 0 is called a "bit". Eight contiguous bits make a "byte". Bytes are used to represent both instructions and data. Each chip model has a list of instructions known as an "Instruction Set", which is a lexicon for "which sequences-of-bits map to which actions-taken-by-the-chip". So far in history, they usually come in 8 bit, 16 bit, 32 bit, or 64 bit. The instructions are used to manipulate chunks of data in the computer's memory. And data (whether it encodes numbers, letters, images, etc), is represented in bytes.
In a sense, a computer's activity can be said to entirely consist of "numbers manipulating numbers". It's all numbers. And pertinently, numbers grow on trees. Equivalently numerals are trees. To explain this, lets discuss Numeral-Systems.
Binary is a base-2 numeral-system. This means there are two digits available for each digit-place: 0 and 1. Each digit-place represents a weight that modulates an implicit power of 2. E.g. bin(10101) can be expanded to (1)(2^4) + (0)(2^3) + (1)(2^2) + (0)(2^1) + (1)(2^0). And the numeral represents a path through a tree, where each digit represents a node. Suppose I wanted to represent bin(0101). the path looks like this:
0
/ \
0 1
/ \ / \
0 1 0 1
/ \ / \ / \ / \
0 1 0 1 0 1 0 1
0 1 2 3 4 5 6 7
We start at the top of the upside-down tree, which is the root. Our first digit 0 is chosen for us. The next digit is 1, which tells us to take the right fork. The next digit is 0, which tells us to take the left fork. And the last digit is 1, which tells us to take the right fork. When we line up the leaf nodes and tag them with decimal numbers, we find that bin(0101) is equal to dec(5). Since computers are great at manipulating numerals, computers must also be great at navigating trees.
(N.B. base-10 numerals are also trees, but with 10 children per node instead of 2. But it's harder to represent. Roman Numerals can also be thought of as trees, but the children per node is not consistent across levels.)
III. Conflict-free Replicated DataTypes (CRDTs)
I remember trying to wrap my head around CRDT's. I don't think it's necessary to go over them in detail. But the idea is that they're abstract-datatypes that can be used over distributed computing networks without because conflicts, because contradictory values are reconciled deterministically. I mention CRDT's because there's 2 types of CRDT's: state-based and operation based. And this got me thinking about how to understand the difference between them.
A state-based CRDT records a history of the states, and an operation based CRDT records a history of the changes in states. "change in state" immediately makes me think of differential calculus. And from there, it's easy to reason that a "change in state" represents information about the new state, minus information about the old state.
state_change = state_final - state_initial
I remember once wondering about the etymology of the word "abstraction". According to etymonline, "abstract" is derived from "ab" ("away from") and "trahere" ("to draw"). In this light, I tend to think of abstraction in terms of drawing away details from a concept until only the relevant details remain. From this, I realized that an operation-based CRDT can be thought of as an abstraction over the history of a state-based CDRT. If you only want to know the state at a particular point in time, you need the entire state. If you want to keep a record of the entire history of the state, you don't need to encode the entire state at each timestamp, because the component of the state that remains the same is repeated from t_n to t_n+1.
For example: suppose I'm baking cookies, and I produce a dozen per hour. At t=0, I have zero cookies; at t=1, I have 12 cookies; at t=2, I have 24 cookies; etc.
State-based CDRT
0; 1; 2; 3; 4; (...)
0; 12; 24; 36; 48; (...)
Operation-based CDRT
0; 1; 2; 3; 4; (...)
0; 12; 12; 12; 12; (...)
the operation-based CDRT is able to record the transition from t=3 to t=4 as "12" because the 36 cookies I had already is implicitly encoded in the previous entries as (0 + 12 + 12 + 12). The information is allowed to be abstracted away since the information is already contained somewhere else in the ADT.
When you consider that a computer can have tons of information about the real world encoded as nothing more than 1's and 0's, it almost seems as though "abstraction" is at the very heart of the essence of information. How is it that the real world can be boiled down to just 1's and 0's? If abstraction is roughly analogous to subtraction, then what's the analog of addition and equality?
IV. The Binary Classification Theory of Knowledge
I'd like to propose something I like to call The Binary Classification Theory of Knowledge1.
"Binary classification" usually means sorting objects into one of two bins. It's used in medical diagnosis, email filters, machine learning, etc. However, sometimes objects get sorted into the wrong bins. This forms a punnett square.
+ - SIGNAL
+ true-positive false-negative
- false-positive true-negative
CONDITION
The signal represents what bin you sorted the object into, and the condition represents the underlying reality. For example, your email probably has a spam filter. The filter tries to distinguish real spam from regular emails, but sometimes it gets it wrong. If the filter sorts it into the spam folder, the email is said to have a positive signal. If the filter sorts it into the normal inbox, the email is said to have a negative signal. Whether the human reader decides an email is truly spam, is called the condition. If the reader decides it's spam, the condition is positive. If the reader decides it's a normal email, the condition is negative. False positives are also commonly known as "false alarms". There's no equivalent term for false negatives that I know of. But in my head-canon I like to refer to them as "submarines", since they're silent yet deadly, and tend to fly under the radar.
Binary Classification looks pretty simple at first glance. But trying to improve a filter is more complicated than it might seem, because there's a multitude of ways to compare and analyze various corners of the punnet square [0]. Two important metrics are specificity and sensitivity. Specificity means X, sensitivity means Y. there's often a tradeoff between them. E.g. for a given spam-filter: you can choose to err on the side of being overly sensitive (which decreases false-negatives, but also unfortunately increases false-positives); or you can choose to err on the side of being overly specific (which decreases false-positives, but also unfortunately increases false-negatives). Another way of thinking about specificity vs sensitivity is to notice that a highly specific filter reliably excludes the condition-negatives (e.g. no false alarms), and that a highly sensitive filter reliably includes the condition-positives (e.g. no surprise-attacks by sneaky submarines).
The reason I bring this up is because I think it serves as a remarkably useful model not only for analyzing boolean values, but for understanding the nature of knowledge itself. In this model, the usefulness of knowledge can be judged by two components: descriptiveness and informativeness and descriptiveness. Descriptiveness is analogous to sensitivity because the concept is built on set-inclusion. Informativeness is analogous to specificity because the concept is built on set-exclusion.
For example, suppose police are interviewing the witnesses of a robbery. the descriptiveness of the testimony depends on how much data the witness is willing to share. If the witness gives a lot of details, the testimony is highly descriptive. Because police are then able to imagine what the suspect looks like and become more alert (aka sensitive) when they meet a stranger whose appearance matches one of the attributes described in the testimony. This broadens the pool of suspects from zero to some. If the witness shares data about the robber which is highly unique, the testimony is highly specific. Because police are then easily able to distinguish (aka specify) a potential suspect from someone innocent whose appearance immediately disqualifies them. Which narrows the pool of suspects from all to some.
V. Specificity : Sensitivity :: Computers : Sensors
This model neatly answers our early question about the nature of computers. A computer's superpower is *specificity*. Computer Science researchers often talk about "complexity". Namely: time complexity; memory complexity; Kolmogorov complexity, etc. But complexity is merely the cost. The thing that the complexity is actually paying for is the *specificity*.
Numbers (being leafnodes in a tree) and numerals (being paths through a tree) are useful to computers for manipulating specific objects because trees compress a set of objects into an efficiently traversable hierarchy. Memory addresses are numbered. So when a computer navigates it's hard-drive, it's implicitly navigating a tree for a specific set of memory cells. And instructions are numbered. So when a software program issues an assembly-code instruction to the CPU, the CPU is implicitly navigating its ISA for a specific behavior. And when the ALU is performing arithmetic, it's implicitly navigating the space of integers for a specific solution to a math problem.
If Computers embody extreme specificity, then what embodies extreme sensitivity? The answer, I believe, are sensors. Like I said earlier, a description starts with an empty set, then seeks to fill the set with objects whose properties match a given filter. A piece of information starts with the universe of discourse, then seeks to remove from the set objects whose properties do not match a given filter. A hardware sensor "senses" some property about the world by checking for a correspondence between a description of the signal it expects vs the actual signal it receives. If it matches the description, it shunts the signal to a computer to be stored and manipulated.
This is surprising because initially, I figured that if computers' superpowers had some sort of "inverse" or "dual", it would be equally as complex. But it's straight-forward. I'm still not quite sure why this is the case. For now, my working hypothesis is that subtracting from an arbitrarily large mass is innately more complex than adding to an empty set.
VI. Theories of Truth
One unintended ramification of the Binary Classification Theory knowledge is that it solves the question of the Nature of Truth.
In philosophy, there's two main schools of thought. There's the Correspondence Theory of Truth, and the Coherence Theory of Truth. According to the Correspondence Theory, a proposition is "true" if it corresponds to reality. According to the Coherence Theory of Truth, a proposition is "true" if it's logically consistent with other propositions in your belief system.
I think that the two schools are both describing different aspects of knowledge. The Correspondence Theory is describing "truth", where truth is defined as a correspondence between your beliefs about the world, and your sensory perception of the world. The Coherence Theory is describing "Validity", which specifies valid states given what you know to be true about the world.
"Truth-seeking" then can be thought of as drawing correspondences between sensory perceptions about the world, and your mental model of the world. "coherence" then, can be thought of as agreement among beliefs as espoused by the mental model.
VII. The Gettier Problem
Another unintended ramification of the Binary Classification Theory of Knowledge is thinking in terms of set-inclusion and set-exclusion gave me an insight regarding the Gettier Problem.
Plato defined knowledge as a "justified true belief". The Gettier Problem asserts that this definition is unsatisfactory, since it's possible to think of examples where a belief seems true and justified, but is intuitively not considered knowledge. From wikipedia:
Case I
Suppose that Smith and Jones have applied for a certain job. And suppose that Smith has strong evidence for the following conjunctive proposition: (d) Jones is the man who will get the job, and Jones has ten coins in his pocket.
Smith's evidence for (d) might be that the president of the company assured him that Jones would, in the end, be selected and that he, Smith, had counted the coins in Jones's pocket ten minutes ago. Proposition (d) entails: (e) The man who will get the job has ten coins in his pocket.
Let us suppose that Smith sees the entailment from (d) to (e), and accepts (e) on the grounds of (d), for which he has strong evidence. In this case, Smith is clearly justified in believing that (e) is true.
But imagine, further, that unknown to Smith, he himself, not Jones, will get the job. And, also, unknown to Smith, he himself has ten coins in his pocket. Proposition (e) is true, though proposition (d), from which Smith inferred (e), is false. In our example, then, all of the following are true: (i) (e) is true, (ii) Smith believes that (e) is true, and (iii) Smith is justified in believing that (e) is true. But it is equally clear that Smith does not know that (e) is true; for (e) is true in virtue of the number of coins in Smith's pocket, while Smith does not know how many coins are in his pocket, and bases his belief in (e) on a count of the coins in Jones's pocket, whom he falsely believes to be the man who will get the job.[1]
Case II
Smith, it is claimed by the hidden interlocutor, has a justified belief that "Jones owns a Ford". Smith therefore (justifiably) concludes (by the rule of disjunction introduction) that "Jones owns a Ford, or Brown is in Barcelona", even though Smith has no information whatsoever about the location of Brown. In fact, Jones does not own a Ford, but by sheer coincidence, Brown really is in Barcelona. Again, Smith had a belief that was true and justified, but not knowledge.
To unpack this, the additional ingredient we need is semiotics, the study of signs. A philosopher named Charles Sanders Pierce once posited that "signs" are triads. Namely, a sign consists of a signifier, a signified, and a referent. A signifier is the word, diagram, icon, word, etc which points to a concept. A signified is the concept in one's mind. And a referent is the physical object or phenomenon that the signified corresponds to. E.g. suppose I ask "where's the bagels?" the vibrations in the air are the signifier, the mental image is the signified, and the bagels that I'm actually looking for are the referent. Let’s refine this model further.
In philosophy, truth-values are said to be a property of propositions. From the perspective of the Binary Classification Theory of Knowledge, truth represents a correspondence between a proposition and a state of the world. But now that we're considering semiotics, we have a new problem. The word "true" supposedly denotes a single relationship between two things. But since signs are triadic, there's 3 relationships which may or may not correspond to each other, rather than 1 relationship.
For the purposes of this essay, I'm going to give each relationship a unique name. Not sure if I'm satisfied with these, maybe I'll come up with better names in the future.
illustrative = correspondence between signifier and signified
faithful = correspondence between signified and referent
factual = correspondence between signifier and referent
E.g. a painting is illustrative of a topic if it delineates the artist's vision in high resolution, even if the subjection matter may be fictional. and a mental-map is faithful to the terrain if it accurately interprets reality with high fidelity, even if there are no signs present or the signs are misleading. E.g. a proposition is factual if the referent can be interpreted as fitting the proposition, even if someone was misled to the wrong conclusion. It turns out that verisimilitude is not only continuous, but multilayered!
The confusion in the Gettier Problem stems from the fact that each case is factually true, but not faithful or illustrative. With this confusion dissolved, I think we can rescue Plato's JTB definition. And additionally, I think we can map "true" onto "correspondence", and map "justified" onto "coherent".
When I generated this theory, it seemed so simple that surely, some dead European philosopher had beaten me to the punch. Or since AI researchers, Doctors, email coders, etc. already work with binary classification, surely they'd noticed what was in front of their noses! But when I checked the Stanford Encyclopedia of Philosophy, all I could find is Robert Nozick discussing “sensitivity conditions”, but not as part of some greater schema. I feel morally obliged to mention this somewhere, though I'm not sure how to fit it into the rest of the essay since it didn’t influence my own train of thought.
> This is surprising because initially, I figured that if computers' superpowers had some sort of "inverse" or "dual", it would be equally as complex. But it's straight-forward. I'm still not quite sure why this is the case. For now, my working hypothesis is that subtracting from an arbitrarily large mass is innately more complex than adding to an empty set.
A sensor's symmetric complexity is mostly on the "outside world" portion of the system, but also partly hidden in plain sight by components which happen to be easy to define in bulk. If you had to build a telescope's ten-meter-wide main mirror by directly adding or subtracting along a grid of perfectly cubic blocks, instead of polishing, how tiny would those cubes need to be? What vast complexity could any 3D printer capable of such a task instead encode on an object of equal size?