Paradigms All The Way Down

Posted on January 10, 2019 by Scott Alexander

Every good conspiracy theorist needs their own Grand Unified Chart; I’m a particular fan of this one. So far, my own Grand Unified Chart looks like this:

Philosophy Of Science	Paradigm	Anomaly	Data
Bayesian Probability	Prior	KL-divergence	Evidence
Psychology	Prediction	Surprisal	Sense-Data
Discourse	Ideology	Cognitive Dissonance	Facts
Society	Frames & Narratives	Exclusion	Lived Experience
Neuroscience	NMDA	Dopamine	AMPA

All of these are examples of interpreting the world through a combination of pre-existing ideas what the world should be like (first column), plus actually experiencing the world (last column). In all of them, the world is too confusing and permits too many different interpretations to understand directly. You wouldn’t even know where to start gathering more knowledge. So you take all of your pre-existing ideas (which you’ve gotten from somewhere) and interpret everything as behaving the way your pre-existing ideas tell you they will. Then as you gradually gather discrepancies between what you expected and what you get (middle column), you gradually become more and more confused until your existing categories buckle under the strain and you generate a new and self-consistent set of pre-existing ideas to see the world through, and then the process begins again.

All of these domains share an idea that the interaction between facts and theories is bidirectional. Your facts may eventually determine what theory you have. But your theory also determines what facts you see and notice. Nor do contradictory facts immediately change a theory. The process of theory change is complicated, fiercely resisted by hard-to-describe factors, and based on some sort of idea of global tension that can’t be directly reduced to any specific contradiction.

(I linked the Discourse and Society levels of the chart to this post where I jokingly sum up the process of convincing someone as “First they ignore you, then they laugh at you, then they fight you, then they fight you half-heartedly, then they’re neutral, then they grudgingly say you might have a point even though you’re annoying, then they say on balance you’re mostly right although you ignore some of the most important facets of the issue, then you win.” My point is that ideological change – most dramatically religious conversion, but also Republicans becoming Democrats and vice versa – doesn’t look like you “debunking” one of their facts and them admitting you are right. It is less like Popperian falsification and more like a Kuhnian paradigm shift or a Yudkowskian crisis of faith.)

Why do all of these areas share this same structure? I think because it’s built into basic algorithms that the brain uses for almost everything (see the Psychology and Neuroscience links above). And that in turn is because it’s just factually the most effective way to do epistemology, a little like asking “why does so much cryptography use prime numbers”.

This entry was posted in Uncategorized. Bookmark the permalink.

58 Responses to Paradigms All The Way Down

Reverse order

Hoopyfreud says:

January 10, 2019 at 6:11 pm

I think that black swans are bayesian anomalies. If an unforseen event highlights a new causal pathway you need to reformulate your prior, not just update it.

Similarly, I think that catastrophe is a societal anomaly. This may be cheating, since catastrophe is by definition an anomaly, but societal frames tend to shatter when really bad shit happens.
- James B says:
  
  January 10, 2019 at 9:36 pm
  
  I think pure Bayesian analysis doesn’t fit this schema. It’s predicated on already knowing what your space of possible models is.
  - Hoopyfreud says:
    
    January 10, 2019 at 9:43 pm
    
    Which we don’t (see: Kuhn), so I usually interpret “Bayesian paradigm” as “Bayesian-proximate practical epistemology.” Pure Bayesian analysis is, of course, as perfect as it is impossible.
  - Nootropic cormorant says:
    
    January 11, 2019 at 12:39 pm
    
    I would say that the space is the space of all probability distributions over your sensors (data receiving nodes).
    - Reasoner says:
      
      January 11, 2019 at 8:33 pm
      
      Trouble is that’s an intractably large number of hypotheses. (And the hypotheses themselves are also intractably large.) So in practice you need a way to compress & prune the hypotheses under consideration. When you’re too aggressive in your pruning, you get the sort of cognitive blind spots that the middle column alerts you to.
NickH says:

January 10, 2019 at 6:31 pm

“Then you win” tends to play out more like “then they say that what you’re saying is obvious”.
- ayegill says:
  
  January 14, 2019 at 9:56 am
  
  If the purpose of the argument was to convince the opponent that what you’re saying is true, that means you’ve won.
ManyCookies says:

January 10, 2019 at 6:32 pm

Every good conspiracy theorist needs their own Grand Unified Chart; I’m a particular fan of this one.

What even the fuck am I looking at? I understand the nodes in the center graph are showing the cult’s timeline, and the edges are the relations/progressions between nodes. But what’s in the whitespace in the graph, and what on earth is going on around the graph?
- Bugmaster says:
  
  January 10, 2019 at 10:15 pm
  
  I suspect that, if you need to ask, then you don’t deserve to know.
  
  🙂
- Trevor Adcock says:
  
  January 10, 2019 at 11:43 pm
  
  It actually sort of makes sense in a way. It’s based on a lot of dubious inferences and even one bad inference topples the whole thing, but it’s clear what it’s trying to say. At least if you know a lot about ancient history and absurd Baal worship conspiracies that some Christians are obsessed with.
  - Bugmaster says:
    
    January 11, 2019 at 12:21 am
    
    Personally, I see this diagram as one huge instance of cultural appropriation, seeing as they obviously copy-pasted the Sefirot, and then just wrote some random conspiracies inside the nodes. Get it together, people, you’ll never summon up any angels that way !
- Watchman says:
  
  January 11, 2019 at 1:25 am
  
  I’m just interested to discover the Jesuit involvement in the Glorious Revolution along with the Freemason myself. The creator(/random collector of historical words and phrases) clearly knows much that I do not.
  - Deiseach says:
    
    January 11, 2019 at 3:41 am
    
    I’m just interested to discover the Jesuit involvement in the Glorious Revolution along with the Freemason myself.
    
    I started to cobble something together for that by going “Well, I suppose if you take…” but luckily I stopped myself in time because that way lies underpants on head territory 🙂
  - Aapje says:
    
    January 11, 2019 at 6:59 am
    
    @Watchman
    
    Indeed, does this mean that William III was secretly a Freemason, who while ostensibly conquering Britain to prevent Catholic rule in favor of Protestantism, actually secretly spread freemasonry?
    - Watchman says:
      
      January 11, 2019 at 9:48 am
      
      I’m more inclined to see the evidence (which I will obviously make up accordingly) as showing the Glorious Revolution was an alliance between the two groups, whose ultimate aims (probably something to do with reestablishing a temple or the like) was foiled by an alliance of Swiss Merovingians and Bavarian illumanti.
      
      If nothing else, this might finally explain why a British army was where it was to fight the Battle of Blenheim 18 years later (and note that 18 is when you come of age, albeit not in any relevant eighteenth-century culture…).
      
      Hey, this is much easier than all that research and stuff!
- Nick says:
  
  January 11, 2019 at 5:37 am
  
  what on earth is going on around the graph?
  
  Tag yourself, I’m VATICAN PAGANISM REBRANDED.
- moonfirestorm says:
  
  January 11, 2019 at 6:00 am
  
  I recognized the sephirot in the diagram, but when I compared it to Wikipedia’s diagrams of how they connect, I noticed some extra lines: namely, connections from Hod to Malkuth and from Yesod to Malkuth.
  
  This means that you can get to Malkuth, “kingship”, by going through Hod, “splendor/glory” instead of Yesod, “foundation”
  
  Fitting for a person with unfounded but exciting theories about the rulers of Earth.
  - Tatterdemalion says:
    
    January 12, 2019 at 12:39 pm
    
    It’s not just extra lines – they’ve added three extra sephiroth; the bottom Templar-Khazar-Jesuits-Masons-Switzerland “box” should just be a single pair of nodes.
- sclmlw says:
  
  January 11, 2019 at 8:11 am
  
  This chart is how you nerd snipe history buffs. 🙂
- Randy M says:
  
  January 11, 2019 at 9:33 am
  
  Can anything with the phrase “Islamic clown fetish masonry” not be a parody or algorythmically generated?
  - Watchman says:
    
    January 11, 2019 at 9:38 am
    
    It explains minarets well. And probably justified the Swiss opposition to such things…
  - Lambert says:
    
    January 11, 2019 at 10:00 am
    
    Not sure about clowns, but the outside of the Brighton Pavillion is a pretty clear fetishisation of Islamic masonry. (The interior being more a fetishisation of Chinese aesthetics)
Big Jay says:

January 10, 2019 at 6:36 pm

I think the middle column for society should say something like “argument/election/revolution”.
Freddie deBoer says:

January 10, 2019 at 8:40 pm

How could you leave off Krishna Consciousness
phi says:

January 10, 2019 at 8:53 pm

You can put “surprisal” in the middle column for Bayesian probability theory.

https://en.wikipedia.org/wiki/Information_content
- temujin9 says:
  
  January 11, 2019 at 10:49 am
  
  Surprisal is a Bayseian term first, deriving from information theory. The psychological term would more rightly be just “surprise”.
- Reasoner says:
  
  January 11, 2019 at 8:17 pm
  
  I don’t think that actually fits in the middle column of the chart though.
  
  This post has some thoughts relevant to the question of why there’s no entry in the middle column for the Bayesian row. (Read from “Get to the goddamn point already. What’s wrong with Bayesianism?”) The way nostalgebraist puts it, Bayesianism “takes the hypothesis space for granted”, and the terms in the middle column suggest data points that make you think the correct hypothesis might be outside your hypothesis space.
shaedys says:

January 11, 2019 at 12:16 am

I’m going to make a point that’s irellevant to the logic of the post, but:

Most symmetric cryptography schemes are not based on prime numbers, and the post-quantum algorithms being developed also do not depend on factoring prime numbers. It’s just most cryptography that you’ll see for setup on the internet.
- albatross11 says:
  
  January 13, 2019 at 9:52 am
  
  Yeah, the difficulty of factoring large integers and finding discrete logs mod large primes makes for some pretty nice public-key crypto algorithms, but neither of these will be difficult for a sufficiently large quantum computer.
  
  One weird aspect of postquantum public key algorithms is that a lot of the basic ideas they’re based on are quite old. Coding-based schemes date back to the 70s, lattice-based crypto goes back to NTRU in the 90s, and really is kind-of a cousin of knapsack-based schemes people were working on in the late 70s. Multivariate signatures go back to the 80s. Hash-based signatures were invented in the 70s.
  
  But the factoring/discrete log based stuff was more efficient and generally nicer (smaller ciphertexts/keys), so most people worked in that area, and ignored the other stuff as kind-of interesting but either not very efficient (coding, hash-based sigs) or not very well-understood (NTRU, multivariate signatures).
Watchman says:

January 11, 2019 at 1:33 am

So to simplify a bit, and cut out some important nuancing (sorry), the world perceived is too complex without filtering, and therefore we impose schema (or models – I tend to use the two unforgivably as synonyms) upon it to allow comprehension? This is pretty much the realisation that underlies postmodernism, albeit that was originally focused on the understanding of texts and symbols, not the interpretation of everything we perceive. I hadn’t expected to wake up this morning to find Scott vaguely indicating the key component in my mental toolbox actually underlies many of the systems used to examine the world.
Rick Hull says:

January 11, 2019 at 3:02 am

My apologies, but I feel like that chart would be great fodder for the Galaxy Brain meme, but probably right to left.
- Nick says:
  
  January 11, 2019 at 5:17 am
  
  Or Weird Sun Twitter names. I like Anomaly of Paradigm.
  - Aron Wall says:
    
    January 11, 2019 at 11:56 am
    
    Feature of Gestalt ain’t bad either.
    
    Edit: Looked at canonical list again and made “feature” singular.
Gabriel says:

January 11, 2019 at 3:28 am

Machine learning also fits quite well. E.g. if you take an artificial neural network with supervised learning. The pre-existing idea are the connection weights at the beginning (randomized when you create the network), you then get a new stimulus/lesson (experience the world) and update your weights to fit the new information better (new theory).
- whereamigoing says:
  
  January 11, 2019 at 9:02 am
  
  Hmm, that seems a bit oversimplified. With supervised learning, the current weights don’t (typically) affect what data is seen next, so there isn’t bidirectional interaction like in the other examples. And changing weights slightly usually looks more like learning within a theory than switching to an entirely new theory.
  
  Maybe there’s some analogy with reinforcement learning though.
sclmlw says:

January 11, 2019 at 6:13 am

I think this paradigm of paradigms needs to be able to address initial conditions. This is especially important, since it essentially says, “you can’t interpret facts without a bias”, but then where do the biases come from?

For most scientific fields, I can see these arising out of previous fields. Philosophy morphs into natural philosophy with Aristotle, then on to science, biology, molecular biology, etc.

Where does philosophy come from? Let’s say it eventually systematizes out of discourse. And discourse is derived from the biases of the individuals.

So now where do individual biases come from? In other words, where does an individual derive their initial biases? I’m inclined to lean toward an early childhood bias toward parents in general and any individual child’s mother in particular. This seems like a particularly adaptive evolutionary strategy – look to parents to derive biases about the wider world. However, I wonder to what extent abandoning some of those biases at a later age – first in mid-childhood, then again in teenage years – might stimulate the next generation to generate new adaptations toward changing external conditions.

Thus, you inherit your biases, but you’re nudged out of them through a natural process of recycling and renewal.
Lambert says:

January 11, 2019 at 7:09 am

What’s the alternative way of doing epistemology?
You have to take in information from the outside world somehow.
But if you had to rebuild your entire worldview from first principles every time you learned something, you’d spend most of your life proving 1+1=2.
So incrementally changing your understanding of reality is kind of the only way to do things.

I suppose it’s most obvious with Bayesian updates, where going back and recomputing everything is a thing you can do.
Eponymous says:

January 11, 2019 at 7:39 am

If I compare this model to an “ideal” Bayesian reasoner, the obvious difference is that one shouldn’t have just one theory of the world, but a whole set of theories. Indeed, in the true ideal version, one starts out with a reverse complexity prior over every possible theory.

While the full version is impossible to implement (even for an AI, let alone a human), I think this gives us some guidance: we should generally put less emphasis on having a single coherent theory of the world, but should instead have in mind a number of different theories. Then we should switch between them and compare how the world looks through each one, and update our beliefs across all theories accordingly.

I suspect that one major defect in human reasoning, compared to a decent reasoner, is our strong need for closure and intolerance for cognitive dissonance. The result is that the epistemic landscape consists of a number of self-consistent “worldviews” that are strong local attractors. This produces the embarrassing path-dependence we observe in peoples’ beliefs, their high level of “stickiness” , and the existence of persistent disagreements.

(This doesn’t mean that a decent Bayesian reasoner would end up uncertain among the standard disagreements among humans. In fact, my guess is that nearly every contentious topic among humans has an “obvious” correct answer under correct reasoning. That doesn’t mean that I know what it is, of course.)
- ADifferentAnonymous says:
  
  January 11, 2019 at 8:08 am
  
  +1 to trying to hold multiple theories. I myself try to do this.
- Watchman says:
  
  January 11, 2019 at 9:35 am
  
  How many people truly hold to a single theory of the world? You’d have to do this knowingly, as otherwise you’ll pick up bits of multiple theories as you go through life and being human will probably use them without having to fully reconcile the differences. The only examples I could think of are fairly-extreme religious belief systems (monasticism perhaps) and maybe the very comparable political zealots.
  
  That said, the normal human viewing the world through a variety of filters is not going to be consciously switching between them. To try and see the world in different ways is probably as comparably rare as to try and only see it in one correct way.
  - Eponymous says:
    
    January 11, 2019 at 12:06 pm
    
    I agree that people hold inconsistent views. But I think they mostly use one framework to think about a particular subject. That seems to be what Scott is saying in the OP at least.
    - Watchman says:
      
      January 12, 2019 at 4:31 am
      
      I don’t claim people normally hold multiple frameworks, just that most people’s frameworks will be composed of pieces of different and potentially contradictory frameworks. An individual, and not entirely logical framework for each person.
- HeelBearCub says:
  
  January 11, 2019 at 11:36 am
  
  Then we should switch between them and compare how the world looks through each one, and update our beliefs across all theories accordingly.
  
  This to me looks like a “duh! that’s what people do.” statement. Although I would modify the “updating” part a little to note that most of the updating occurs in terms of when to use which model. Attempting to reconcile across all the models isn’t necessarily possible, so you don’t try to update them all.
  
  It’s those who expect humans to be consistent to a single model that look irrational to me.
Loris says:

January 11, 2019 at 12:33 pm

I was wondering if it was possible to come up with atypical rows for that Grand Unified Chart.
I quite like this one:

1984 : History, Doublethink, History
Snarwin says:

January 11, 2019 at 12:35 pm

It looks like the row labels for “Psychology” and “Perception” have been swapped by accident.
G Gordon Worley III says:

January 11, 2019 at 1:50 pm

I’ve done a bit of thinking in this direction also and wrote about it a while back in terms of a personal growth cycle. It seems to be just something fundamental to how stuff interacts in ways that results in useful complexity.
- Reasoner says:
  
  January 11, 2019 at 8:24 pm
  
  All the rows in the chart are descriptive (of human reasoning) rather than prescriptive, except the Bayesian Probability row, which is prescriptive. And the prescriptive row is itself a little anomalous because there’s no entry in the middle column (I left a comment lower in the thread talking about why this is). I think it’s possible there are other prescriptive methods that don’t quite fit the pattern, which might outperform human reasoning. For example, if you’re an AI and you have an unlimited supply of electricity, you might choose to recompute your models on a continuous basis instead of waiting until anomalous-looking data points arrive.
owengray says:

January 11, 2019 at 2:19 pm

I think there are two additional columns (for essentially failure modes) you could add to your table that would let you draw more analogies, make your claim broader, and make more obvious some of the hypotheses.

PhiloSci __ Crackpot? ____________ Denier?
Bayesian_ Require too little evidence_ Demand too much
_______Underconfidence _______ Overconfidence
Psych ____ Schizophrenia _______ Autism
Percept __ Hallucinations _______ ??? (Masks?)
Discourse_Conflict theory? _____ Mistake Theory?
Society __ Young generation? ___ Old generation?
Neurosci ?Hallucinogens? SSRIs? Amphetamines?

I’m pretty sure there are much better words for many of those cells (and I suspect that some of the ones I have a quite wrong), but you seem to be seeking correlations within each of these failure mode columns, in some of your hypotheses/predictions for the survey.

It’s perhaps unlikely that one neurocognitive misfunction could push someone into one failure mode for all of these categories, as some of the categories likely have larger driving factors than just individual neurochemistry, like social groups. But I’d expect significant intra-column correlation, which would be one of the strongest and most-easily-testable and most-elegant-looking predictions of your theory.

EDIT: I hate this formatting

(Some of these seem really wrong to me, on second look. A crackpot first requires too little evidence to form a theory, then too much evidence to change it. You could argue that’s a transition from crackpot -> denier, but I suspect they’re just very ill-chosen words for that cell.)

I feel like cults should fit into this framework somehow, and you’ve written about them before.
Perhaps related to something about religion being about finding meaning in everything. That might be bleeding into the same mismatching issues with crackpot, though.
eigenmoon says:

January 12, 2019 at 12:51 am

It is typical for a deep neural network to become stuck for a while and then continue improving. In his “Efficient BackProp”, Yann LeCun shows that this happens when the Hessian (the quadratic approximation of the local landscape) has several huge eigenvalues he nicknamed “big killers”. (See Fig. 20 on p. 34 for the distribution of eigenvalues in a actual network.)

Having a huge eigenvalue means that the local landscape looks like a narrow ravine. Imagine that you live on the landscape of the function x² + 1000000 y² at the point (1, 0.01) and you want to find the lowest point. You don’t know where the origin or the axes are. The movements along the y axis seem to be the most important, and that’s where all the gradient descent action is going to be. But in the long run the actually important movements are little steps along the x axis that are unnoticeable at first.
- Eponymous says:
  
  January 12, 2019 at 8:11 am
  
  Imagine that you live on the landscape of the function x² + 1000000 y² at the point (1, 0.01) and you want to find the lowest point. You don’t know where the origin or the axes are. The movements along the y axis seem to be the most important, and that’s where all the gradient descent action is going to be. But in the long run the actually important movements are little steps along the x axis that are unnoticeable at first.
  
  I don’t understand your example. Moving x from 1 to 0 reduces the value of the function by 1. Moving y from 0.01 to 0 reduces the value of the function by 100. So the small movements in the y direction end up being 100x more important.
  - eigenmoon says:
    
    January 12, 2019 at 10:29 am
    
    I meant “important” in terms of distance to the actual minimum.
    
    In a more realistic scenario, the function you have is complicated (definitely not quadratic) but if you measure all the second derivatives at the point where you live, you can get a quadratic approximation. So if you are currently at the point (1, 0.01) and x² + 1000000 y² is merely a quadratic approximation at your current location, you can figure out that you need to in the general direction of x = y = 0. In this sense, moving along x is more important. When you get there though, you might find something completely different from a local minimum since your function is actually not quadratic. Maybe you’ll find a big slide downward. If that’s the case, finding the slide was a lot more important than moving along y – this time “important” in the sense of the function value as well.
stoneinthewaves says:

January 12, 2019 at 9:10 am

The pattern you’re pointing to is also the central point in hermeneutics, the hermeneutic circle. The basic idea of the circle is that to understand any text as a whole, you have to understand all its parts, while to fully understand any part you have to have a complete knowledge of the whole context. For example, in interpreting the Bible you clearly have to read John 1:1, but to understand John 1:1 it helps to have read the rest of the Bible so you have a good sense of who this “God” character is. This is true at every scale — to understand the single sentence in John 1:1, you have to recognize all the words, but clearly to get the meaning of each word you need to know what it’s doing in the context of the sentence, while for interpreting the Bible you should also know as much as possible about the historic context in which it was written, the relevant languages and translation history, etc.

The thing that’s great about hermeneutics is that it takes something that seems like an insuperable problem — how can we ever understand anything perfectly if understanding is based on this endless back-and-forth of contextualization? — and basically solves it by saying that, well, we can’t, but we can certainly do better. There’s a faith in hermeneutics that something like a True Meaning Of The Text exists, and that it can be approached asymptotically, much like the predictive processing model allows for a steady pragmatic adjustment towards an objective reality that we can’t confirm is fully represented in any model. It’s when our background model runs into interpretive trouble that we consciously activate the hermeneutic circle, and either digest the troublesome information into our model, or reshape our idea of the text/world as a whole.

As far as I remember from reading Gadamer (maybe the biggest name in modern hermeneutics), the basic question of how we iteratively improve our understanding based on the perception of misfit between model and evidence is one that applies to all fields, although it’s most apparent when working with language. You make certain hypotheses that seem likely to be true, then check them against a portion of the available evidence, then rinse and repeat ad infinitum. The crucial thing is that for you to advance most rapidly, you need to preselect among the possible hypotheses and pieces of evidence, in a process that’s somewhat mysterious. As far as I can tell no one believes you can find a fully “right” method to deal with this; it comes down to an odd mix of explicitly defended priors and gut feelings that you might call “judgement.” That’s disturbing if you want easily defensible justifications for every epistemological move. But the idea that everything (including learning how to walk) works roughly this way makes it easier to swallow that at least sometimes it’s indispensable.

One more thing: Nick Szabo has a nice application of hermeneutic theory to biological/cultural evolution. Selective experimentation followed by broad adoption of successful variants appears to be a strategy so good it recurs in almost all processes that can be classified as developmental, including when no one mind is doing the selecting. The obvious element of randomness involved in something like evolution also might help explain why judgement is acceptable: if we didn’t try out hypotheses that aren’t fully justifiable a priori, it’s very hard to see how we would ever improve our underlying beliefs.
thomasbrinsmead says:

January 12, 2019 at 11:45 am

The missing column for Society might be social/ political crisis / revolution. Examples include the French Revolution, end of legalised slavery, the Beat Generation of the sixties, the replacement of monarchy by demecracy, and feminism.

Evolutionary biology might have: continuous evolution of breeds in a stable ecological environment, pre-reproductive mortality and population collapse, discontinuous evolution and speciation.

The process dynamics archetype implicitly referenced by this framework applies at many scales- from the small as when a linguist learns a new alphabet, to mid-scale for a new language, to large scale for a new cultural worldview, and averything in between.
- Matthias says:
  
  January 14, 2019 at 5:32 am
  
  Not sure Abolition of Slavery was such a big paradigm shift? Is what mostly a case of recognising the hypocrisy of allowing abroad what is forbidden at home.
davidbahry says:

January 13, 2019 at 4:22 pm

Incidentally, surprisal and KL-divergence are related but non-identical. (following https://math.stackexchange.com/a/374981/470287 for the explanation of surprisal and entropy and https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explained for KL divergence)

– Surprisal is just a transformation of probability. Like probability, it measures uncertainty about individual outcomes. Its definition fits two intuitive constraints: 1) the less likely something is, the higher its surprisingness if it does happen; 2) if two things are independent, then the surprisingness if they both happen, should equal the sum of their individual surprisingnesses. So, fitting those constraints, if outcome i has probability p(A), then the surprisal of outcome A is just

surprisal(A) = -log[p(A)]

– The long-run average surprisingness of samples drawn from a probability distribution – the probability-weighted average of the distributions’ outcomes’ surprisals – is that distribution’s entropy:

Entropy(distribution) = summation over i of: -log[p(i)]*p(i)

– KL divergence‘s formula looks similar to the entropy formula, but is for describing the dissimilarity between two different distributions. It can be intuitively thought of as describing the long-run average difference in surprisal, as outcomes are observed, between, let’s say, two people who have two different subjective probability distributions for the set of outcomes, under the assumption that one of the people is “right”, in that the objective frequency distribution for the outcomes matches their subjective probability distribution. (It is asymmetric – divergence(BOB|ALICE) =/= divergence(ALICE|BOB) – because it depends on which one we assume to be right.) Where p(i) and q(i) are disagreeing subjective probabilities for the same outcome, but the frequency-weighting follows p(i), the expected difference in surprisal is

KLdivergence(p||q) = summation over i of: {log[p(i)]-log[q(i)]}*p(i)
Eli says:

January 13, 2019 at 5:33 pm

Sounds like you’re noticing that most inferential processes involve overhypotheses, and that these overhypotheses themselves can only be changed by building up large bodies of conjunctive data (this observed fact and that observed fact, across different modes of observation). I wrote a paper about this back in ~2016 (though it got rejected).
crucialrhyme says:

January 14, 2019 at 12:56 pm

I just feel the need to quixotically state that there there’s nothing particularly special about KL-divergence; there are other reasonable ways of measuring distance between probability distributions. It just got picked because it’s simple and computationally convenient.

In particular, the Kantorovich-Rubinstein optimal transport metric (aka Wasserstein distance) is arguably much better and more elegant whenever you’re looking at probability distributions over a space that already has some underlying notion of distance, which you almost always are (real numbers, or Euclidean vectors). Thanks to Marco Cuturi [1], if you’re willing to tolerate a tiny bit of regularization, there is now a very fast algorithm for numerically solving optimal transport problems.

This is one of the things that bugs me about Friston’s “free energy” thing: the whole idea relies on the assumption that when you do variational inference, you inherently want to minimize KL-divergence. But to me using KL-divergence seems like just a computational convenience, not anything fundamental.

1: http://marcocuturi.net/Papers/cuturi13sinkhorn.pdf
False says:

January 14, 2019 at 6:32 pm

Can’t believe I get to be the first to point out that this is just the Hegelian Dialectic!

Blogroll

Economics

Effective Altruism

Rationality

Science

SSC Elsewhere

Archives

Paradigms All The Way Down

58 Responses to Paradigms All The Way Down

Meta