The Control Group Is Out Of Control

I.

Allan Crossman calls parapsychology the control group for science.

That is, in let’s say a drug testing experiment, you give some people the drug and they recover. That doesn’t tell you much until you give some other people a placebo drug you know doesn’t work – but which they themselves believe in – and see how many of them recover. That number tells you how many people will recover whether the drug works or not. Unless people on your real drug do significantly better than people on the placebo drug, you haven’t found anything.

On the meta-level, you’re studying some phenomenon and you get some positive findings. That doesn’t tell you much until you take some other researchers who are studying a phenomenon you know doesn’t exist – but which they themselves believe in – and see how many of them get positive findings. That number tells you how many studies will discover positive results whether the phenomenon is real or not. Unless studies of the real phenomenon do significantly better than studies of the placebo phenomenon, you haven’t found anything.

Trying to set up placebo science would be a logistical nightmare. You’d have to find a phenomenon that definitely doesn’t exist, somehow convince a whole community of scientists across the world that it does, and fund them to study it for a couple of decades without them figuring it out.

Luckily we have a natural experiment in terms of parapsychology – the study of psychic phenomena – which most reasonable people believe don’t exist, but which a community of practicing scientists believes in and publishes papers on all the time.

The results are pretty dismal. Parapsychologists are able to produce experimental evidence for psychic phenomena about as easily as normal scientists are able to produce such evidence for normal, non-psychic phenomena. This suggests the existence of a very large “placebo effect” in science – ie with enough energy focused on a subject, you can always produce “experimental evidence” for it that meets the usual scientific standards. As Eliezer Yudkowsky puts it:

Parapsychologists are constantly protesting that they are playing by all the standard scientific rules, and yet their results are being ignored – that they are unfairly being held to higher standards than everyone else. I’m willing to believe that. It just means that the standard statistical methods of science are so weak and flawed as to permit a field of study to sustain itself in the complete absence of any subject matter.

These sorts of thoughts have become more common lately in different fields. Psychologists admit to a crisis of replication as some of their most interesting findings turn out to be spurious. And in medicine, John Ioannides and others have been criticizing the research for a decade now and telling everyone they need to up their standards.

“Up your standards” has been a complicated demand that cashes out in a lot of technical ways. But there is broad agreement among the most intelligent voices I read (1, 2, 3, 4, 5) about a couple of promising directions we could go:

1. Demand very large sample size.

2. Demand replication, preferably exact replication, most preferably multiple exact replications.

3. Trust systematic reviews and meta-analyses rather than individual studies. Meta-analyses must prove homogeneity of the studies they analyze.

4. Use Bayesian rather than frequentist analysis, or even combine both techniques.

5. Stricter p-value criteria. It is far too easy to massage p-values to get less than 0.05. Also, make meta-analyses look for “p-hacking” by examining the distribution of p-values in the included studies.

6. Require pre-registration of trials.

7. Address publication bias by searching for unpublished trials, displaying funnel plots, and using statistics like “fail-safe N” to investigate the possibility of suppressed research.

8. Do heterogeneity analyses or at least observe and account for differences in the studies you analyze.

9. Demand randomized controlled trials. None of this “correlated even after we adjust for confounders” BS.

10. Stricter effect size criteria. It’s easy to get small effect sizes in anything.

If we follow these ten commandments, then we avoid the problems that allowed parapsychology and probably a whole host of other problems we don’t know about to sneak past the scientific gatekeepers.

Well, what now, motherfuckers?

II.

Bem, Tressoldi, Rabeyron, and Duggan (2014), full text available for download at the top bar of the link above, is parapsychology’s way of saying “thanks but no thanks” to the idea of a more rigorous scientific paradigm making them quietly wither away.

You might remember Bem as the prestigious establishment psychologist who decided to try his hand at parapsychology and to his and everyone else’s surprise got positive results. Everyone had a lot of criticisms, some of which were very very good, and the study failed replication several times. Case closed, right?

Earlier this month Bem came back with a meta-analysis of ninety replications from tens of thousands of participants in thirty three laboratories in fourteen countries confirming his original finding, p < 1.2 * -1010, Bayes factor 7.4 * 109, funnel plot beautifully symmetrical, p-hacking curve nice and right-skewed, Orwin fail-safe n of 559, et cetera, et cetera, et cetera.

By my count, Bem follows all of the commandments except [6] and [10]. He apologizes for not using pre-registration, but says it’s okay because the studies were exact replications of a previous study that makes it impossible for an unsavory researcher to change the parameters halfway through and does pretty much the same thing. And he apologizes for the small effect size but points out that some effect sizes are legitimately very small, this is no smaller than a lot of other commonly-accepted results, and that a high enough p-value ought to make up for a low effect size.

This is far better than the average meta-analysis. Bem has always been pretty careful and this is no exception. Yet its conclusion is that psychic powers exist.

So – once again – what now, motherfuckers?

III.

In retrospect, that list of ways to fix science above was a little optimistic.

The first nine items (large sample sizes, replications, low p-values, Bayesian statistics, meta-analysis, pre-registration, publication bias, heterogeneity) all try to solve the same problem: accidentally mistaking noise in the data for a signal.

We’ve placed so much emphasis on not mistaking noise for signal that when someone like Bem hands us a beautiful, perfectly clear signal on a silver platter, it briefly stuns us. “Wow, of the three hundred different terrible ways to mistake noise for signal, Bem has proven beyond a shadow of a doubt he hasn’t done any of them.” And we get so stunned we’re likely to forget that this is only part of the battle.

Bem definitely picked up a signal. The only question is whether it’s a signal of psi, or a signal of poor experimental technique.

None of these commandments even touch poor experimental technique – or confounding, or whatever you want to call it. If an experiment is confounded, if it produces a strong signal even when its experimental hypothesis is true, then using a larger sample size will just make that signal even stronger.

Replicating it will just reproduce the confounded results again.

Low p-values will be easy to get if you perform the confounded experiment on a large enough scale.

Meta-analyses of confounded studies will obey the immortal law of “garbage in, garbage out”.

Pre-registration only assures that your study will not get any worse than it was the first time you thought of it, which may be very bad indeed.

Searching for publication bias only means you will get all of the confounded studies, instead of just some of them.

Heterogeneity just tells you whether all of the studies were confounded about the same amount.

Bayesian statistics, alone among these first eight, ought to be able to help with this problem. After all, a good Bayesian should be able to say “Well, I got some impressive results, but my prior for psi is very low, so this raises my belief in psi slightly, but raises my belief that the experiments were confounded a lot.”

Unfortunately, good Bayesians are hard to come by, and the researchers here seem to be making some serious mistakes. Here’s Bem:

An opportunity to calculate an approximate answer to this question emerges from a Bayesian critique of Bem’s (2011) experiments by Wagenmakers, Wetzels, Borsboom, & van der Maas (2011). Although Wagenmakers et al. did not explicitly claim psi to be impossible, they came very close by setting their prior odds at 10^20 against the psi hypothesis. The Bayes Factor for our full database is approximately 10^9 in favor of the psi hypothesis (Table 1), which implies that our meta-analysis should lower their posterior odds against the psi hypothesis to 10^11

Let me shame both participants in this debate.

Bem, you are abusing Bayes factor. If Wagenmakers uses your 10^9 Bayes factor to adjust from his prior of 10^-20 to 10^-11, then what happens the next time you come up with another database of studies supporting your hypothesis? We all know you will, because you’ve amply proven these results weren’t due to chance, so whatever factor produced these results – whether real psi or poor experimental technique – will no doubt keep producing them for the next hundred replication attempts. When those come in, does Wagenmakers have to adjust his probability from 10^-11 to 10^-2? When you get another hundred studies, does he have to go from 10^-2 to 10^7? If so, then by conservation of expected evidence he should just update to 10^+7 right now – or really to infinity, since you can keep coming up with more studies till the cows come home. But in fact he shouldn’t do that, because at some point his thought process becomes “Okay, I already know that studies of this quality can consistently produce positive findings, so either psi is real or studies of this quality aren’t good enough to disprove it”. This point should probably happen well before he increases his probability by a factor of 10^9. See Confidence Levels Inside And Outside An Argument for this argument made in greater detail.

Wagenmakers, you are overconfident. Suppose God came down from Heaven and said in a booming voice “EVERY SINGLE STUDY IN THIS META-ANALYSIS WAS CONDUCTED PERFECTLY WITHOUT FLAWS OR BIAS, AS WAS THE META-ANALYSIS ITSELF.” You would see a p-value of less than 1.2 * 10^-10 and think “I bet that was just coincidence”? And then they could do another study of the same size, also God-certified, returning exactly the same results, and you would say “I bet that was just coincidence too”? YOU ARE NOT THAT CERTAIN OF ANYTHING. Seriously, read the @#!$ing Sequences.

Bayesian statistics, at least the way they are done here, aren’t gong to be of much use to anybody.

That leaves randomized controlled trials and effect sizes.

Randomized controlled trials are great. They eliminate most possible confounders in one fell swoop, and are excellent at keeping experimenters honest. Unfortunately, most of the studies in the Bem meta-analysis were already randomized controlled trials.

High effect sizes are really the only thing the Bem study lacks. And it is very hard to experimental technique so bad that it consistently produces a result with a high effect size.

But as Bem points out, demanding high effect size limits our ability to detect real but low-effect phenomena. Just to give an example, many physics experiments – like the ones that detected the Higgs boson or neutrinos – rely on detecting extremely small perturbations in the natural order, over millions of different trials. Less esoterically, Bem mentions the example of aspirin decreasing heart attack risk, which it definitely does and which is very important, but which has an effect size lower than that of his psi results. If humans have some kind of very weak psionic faculty that under regular conditions operates poorly and inconsistently, but does indeed exist, then excluding it by definition from the realm of things science can discover would be a bad idea.

All of these techniques are about reducing the chance of confusing noise for signal. But when we think of them as the be-all and end-all of scientific legitimacy, we end up in awkward situations where they come out super-confident in a study’s accuracy simply because the issue was one they weren’t geared up to detect. Because a lot of the time the problem is something more than just noise.

IV.

Wiseman & Schlitz’s Experimenter Effects And The Remote Detection Of Staring is my favorite parapsychology paper ever and sends me into fits of nervous laughter every time I read it.

The backstory: there is a classic parapsychological experiment where a subject is placed in a room alone, hooked up to a video link. At random times, an experimenter stares at them menacingly through the video link. The hypothesis is that this causes their galvanic skin response (a physiological measure of subconscious anxiety) to increase, even though there is no non-psychic way the subject could know whether the experimenter was staring or not.

Schiltz is a psi believer whose staring experiments had consistently supported the presence of a psychic phenomenon. Wiseman, in accordance with nominative determinism is a psi skeptic whose staring experiments keep showing nothing and disproving psi. Since they were apparently the only two people in all of parapsychology with a smidgen of curiosity or rationalist virtue, they decided to team up and figure out why they kept getting such different results.

The idea was to plan an experiment together, with both of them agreeing on every single tiny detail. They would then go to a laboratory and set it up, again both keeping close eyes on one another. Finally, they would conduct the experiment in a series of different batches. Half the batches (randomly assigned) would be conducted by Dr. Schlitz, the other half by Dr. Wiseman. Because the two authors had very carefully standardized the setting, apparatus and procedure beforehand, “conducted by” pretty much just meant greeting the participants, giving the experimental instructions, and doing the staring.

The results? Schlitz’s trials found strong evidence of psychic powers, Wiseman’s trials found no evidence whatsoever.

Take a second to reflect on how this makes no sense. Two experimenters in the same laboratory, using the same apparatus, having no contact with the subjects except to introduce themselves and flip a few switches – and whether one or the other was there that day completely altered the result. For a good time, watch the gymnastics they have to do to in the paper to make this sound sufficiently sensical to even get published. This is the only journal article I’ve ever read where, in the part of the Discussion section where you’re supposed to propose possible reasons for your findings, both authors suggest maybe their co-author hacked into the computer and altered the results.

While it’s nice to see people exploring Bem’s findings further, this is the experiment people should be replicating ninety times. I expect something would turn up.

As it is, Kennedy and Taddonio list ten similar studies with similar results. One cannot help wondering about publication bias (if the skeptic and the believer got similar results, who cares?). But the phenomenon is sufficiently well known in parapsychology that it has led to its own host of theories about how skeptics emit negative auras, or the enthusiasm of a proponent is a necessary kindling for psychic powers.

Other fields don’t have this excuse. In psychotherapy, for example, practically the only consistent finding is that whatever kind of psychotherapy the person running the study likes is most effective. Thirty different meta-analyses on the subject have confirmed this with strong effect size (d = 0.54) and good significance (p = .001).

Then there’s Munder (2013), which is a meta-meta-analysis on whether meta-analyses of confounding by researcher allegiance effect were themselves meta-confounded by meta-researcher allegiance effect. He found that indeed, meta-researchers who believed in researcher allegiance effect were more likely to turn up positive results in their studies of researcher allegiance effect (p < .002). It gets worse. There's a famous story about an experiment where a scientist told teachers that his advanced psychometric methods had predicted a couple of kids in their class were about to become geniuses (the students were actually chosen at random). He followed the students for the year and found that their intelligence actually increased. This was supposed to be a Cautionary Tale About How Teachers’ Preconceptions Can Affect Children.

Less famous is that the same guy did the same thing with rats. He sent one laboratory a box of rats saying they were specially bred to be ultra-intelligent, and another lab a box of (identical) rats saying they were specially bred to be slow and dumb. Then he had them do standard rat learning tasks, and sure enough the first lab found very impressive results, the second lab very disappointing ones.

This scientist – let’s give his name, Robert Rosenthal – then investigated three hundred forty five different studies for evidence of the same phenomenon. He found effect sizes of anywhere from 0.15 to 1.7, depending on the type of experiment involved. Note that this could also be phrased as “between twice as strong and twenty times as strong as Bem’s psi effect”. Mysteriously, animal learning experiments displayed the highest effect size, supporting the folk belief that animals are hypersensitive to subtle emotional cues.

Okay, fine. Subtle emotional cues. That’s way more scientific than saying “negative auras”. But the question remains – what went wrong for Schlitz and Wiseman? Even if Schlitz had done everything short of saying “The hypothesis of this experiment is for your skin response to increase when you are being stared at, please increase your skin response at that time,” and subjects had tried to comply, the whole point was that they didn’t know when they were being stared at, because to find that out you’d have to be psychic. And how are these rats figuring out what the experimenters’ subtle emotional cues mean anyway? I can’t figure out people’s subtle emotional cues half the time!

I know that standard practice here is to tell the story of Clever Hans and then say That Is Why We Do Double-Blind Studies. But first of all, I’m pretty sure no one does double-blind studies with rats. Second of all, I think most social psych studies aren’t double blind – I just checked the first one I thought of, Aronson and Steele on stereotype threat, and it certainly wasn’t. Third of all, this effect seems to be just as common in cases where it’s hard to imagine how the researchers’ subtle emotional cues could make a difference. Like Schlitz and Wiseman. Or like the psychotherapy experiments, where most of the subjects were doing therapy with individual psychologists and never even saw whatever prestigious professor was running the study behind the scenes.

I think it’s a combination of subconscious emotional cues, subconscious statistical trickery, perfectly conscious fraud which for all we know happens much more often than detected, and things we haven’t discovered yet which are at least as weird as subconscious emotional cues. But rather than speculate, I prefer to take it as a brute fact. Studies are going to be confounded by the allegiance of the researcher. When researchers who don’t believe something discover it, that’s when it’s worth looking into.

V.

So what exactly happened to Bem?

Although Bem looked hard to find unpublished material, I don’t know if he succeeded. Unpublished material, in this context, has to mean “material published enough for Bem to find it”, which in this case was mostly things presented at conferences. What about results so boring that they were never even mentioned?

And I predict people who believe in parapsychology are more likely to conduct parapsychology experiments than skeptics. Suppose this is true. And further suppose that for some reason, experimenter effect is real and powerful. That means most of the experiments conducted will support Bem’s result. But this is still a weird form of “publication bias” insofar as it ignores the contrary results of hypotheticaly experiments that were never conducted.

And worst of all, maybe Bem really did do an excellent job of finding every little two-bit experiment that no journal would take. How much can we trust these non-peer-reviewed procedures?

I looked through his list of ninety studies for all the ones that were both exact replications and had been peer-reviewed (with one caveat to be mentioned later). I found only seven:

Batthyany, Kranz, and Erber: .268
Ritchie 1: 0.015
Ritchie 2: -0.219
Richie 3: -0.040
Subbotsky 1: 0.279
Subbotsky 2: 0.292
Subbotsky 3: -.399

Three find large positive effects, two find approximate zero effects, and two find large negative effects. Without doing any calculatin’, this seems pretty darned close to chance for me.

Okay, back to that caveat about replications. One of Bem’s strongest points was how many of the studies included were exact replications of his work. This is important because if you do your own novel experiment, it leaves a lot of wiggle room to keep changing the parameters and statistics a bunch of times until you get the effect you want. This is why lots of people want experiments to be preregistered with specific committments about what you’re going to test and how you’re going to do it. These experiments weren’t preregistered, but conforming to a previously done experiment is a pretty good alternative.

Except that I think the criteria for “replication” here were exceptionally loose. For example, Savva et al was listed as an “exact replication” of Bem, but it was performed in 2004 – seven years before Bem’s original study took place. I know Bem believes in precognition, but that’s going too far. As far as I can tell “exact replication” here means “kinda similar psionic-y thing”. Also, Bem classily lists his own experiments as exact replications of themselves, which gives a big boost to the “exact replications return the same results as Bem’s original studies” line. I would want to see much stricter criteria for replication before I relax the “preregister your trials” requirement.

(Richard Wiseman – the same guy who provided the negative aura for the Wiseman and Schiltz experiment – has started a pre-register site for Bem replications. He says he has received five of them. This is very promising. There is also a separate pre-register for parapsychology trials in general. I am both extremely pleased at this victory for good science, and ashamed that my own field is apparently behind parapsychology in the “scientific rigor” department)

That is my best guess at what happened here – a bunch of poor-quality, peer-unreviewed studies that weren’t as exact replications as we would like to believe, all subject to mysterious experimenter effects.

This is not a criticism of Bem or a criticism of parapsychology. It’s something that is inherent to the practice of meta-analysis, and even more, inherent to the practice of science. Other than a few very exceptional large medical trials, there is not a study in the world that would survive the level of criticism I am throwing at Bem right now.

I think Bem is wrong. The level of criticism it would take to prove a wrong study wrong is higher than that almost any existing study can withstand. That is not encouraging for existing studies.

VI.

The motto of the Royal Society – Hooke, Boyle, Newton, some of the people who arguably invented modern science – was nullus in verba, “take no one’s word”.

This was a proper battle cry for seventeenth century scientists. Think about the (admittedly kind of mythologized) history of Science. The scholastics saying that matter was this, or that, and justifying themselves by long treatises about how based on A, B, C, the word of the Bible, Aristotle, self-evident first principles, and the Great Chain of Being all clearly proved their point. Then other scholastics would write different long treatises on how D, E, and F, Plato, St. Augustine, and the proper ordering of angels all indicated that clearly matter was something different. Both groups were pretty sure that the other had make a subtle error of reasoning somewhere, and both groups were perfectly happy to spend centuries debating exactly which one of them it was.

And then Galileo said “Wait a second, instead of debating exactly how objects fall, let’s just drop objects off of something really tall and see what happens”, and after that, Science.

Yes, it’s kind of mythologized. But like all myths, it contains a core of truth. People are terrible. If you let people debate things, they will do it forever, come up with horrible ideas, get them entrenched, play politics with them, and finally reach the point where they’re coming up with theories why people who disagree with them are probably secretly in the pay of the Devil.

Imagine having to conduct the global warming debate, except that you couldn’t appeal to scientific consensus and statistics because scientific consensus and statistics hadn’t been invented yet. In a world without science, everything would be like that.

Heck, just look at philosophy.

This is the principle behind the Pyramid of Scientific Evidence. The lowest level is your personal opinions, no matter how ironclad you think the logic behind them is. Just above that is expert opinion, because no matter how expert someone is they’re still only human. Above that is anecdotal evidence and case studies, because even though you’re finally getting out of people’s heads, it’s still possible for the content of people’s heads to influence which cases they pay attention to. At each level, we distill away more and more of the human element, until presumably at the top the dross of humanity has been purged away entirely and we end up with pure unadulterated reality.

The Pyramid of Scientific Evidence

And for a while this went well. People would drop things off towers, or see how quickly gases expanded, or observe chimpanzees, or whatever.

Then things started getting more complicated. People started investigating more subtle effects, or effects that shifted with the observer. The scientific community became bigger, everyone didn’t know everyone anymore, you needed more journals to find out what other people had done. Statistics became more complicated, allowing the study of noisier data but also bringing more peril. And a lot of science done by smart and honest people ended up being wrong, and we needed to figure out exactly which science that was.

And the result is a lot of essays like this one, where people who think they’re smart take one side of a scientific “controversy” and say which studies you should believe. And then other people take the other side and tell you why you should believe different studies than the first person thought you should believe. And there is much argument and many insults and citing of authorities and interminable debate for, if not centuries, at least a pretty long time.

The highest level of the Pyramid of Scientific Evidence is meta-analysis. But a lot of meta-analyses are crap. This meta-analysis got p < 1.2 * 10^-10 for a conclusion I'm pretty sure is false, and it isn’t even one of the crap ones. Crap meta-analyses look more like this, or even worse.

How do I know it’s crap? Well, I use my personal judgment. How do I know my personal judgment is right? Well, a smart well-credentialed person like James Coyne agrees with me. How do I know James Coyne is smart? I can think of lots of cases where he’s been right before. How do I know those count? Well, John Ioannides has published a lot of studies analyzing the problems with science, and confirmed that cases like the ones Coyne talks about are pretty common. Why can I believe Ioannides’ studies? Well, there have been good meta-analyses of them. But how do I know if those meta-analyses are crap or not? Well…

The Ouroboros of Scientific Evidence

Science! YOU WERE THE CHOSEN ONE! It was said that you would destroy reliance on biased experts, not join them! Bring balance to epistemology, not leave it in darkness!

I LOVED YOU!!!!

Edit: Conspiracy theory by Andrew Gelman

This entry was posted in Uncategorized and tagged , , , . Bookmark the permalink.

197 Responses to The Control Group Is Out Of Control

  1. suntzuanime says:

    Imagine the global warming debate, but you couldn’t appeal to scientific consensus or statistics because you didn’t really understand the science or the statistics, and you just had to take some people who claimed to know what was going on at their verba.

    “Take no one’s word” sounds like a good rallying cry when it comes to dropping a bowling ball and a feather and seeing which hits the ground first, but I don’t have my own global temperature monitoring stations that I’ve been running for the past fifty years, and even if I did I probably wouldn’t be smart enough to know if a climate model based on them was bullshit or not.

    I guess this is sort of the point you were making? But it’s weird to cite climate change as a counterexample.

    Now I’m wondering if you were just doing the thing where you subtly undercut your own points as you make them out of sheer uncontrollable perversity. This edit button is a curse, not a blessing.

    • Oligopsony says:

      If some of the weirder psi suppression theories are right, psi should actually be easier to study by conducting personal experiments than by trying to study or do public science, especially if you precommit yourself to not telling anyone about the results.

    • Douglas Knight says:

      A 20 year project for that purpose? Why doesn’t he seek publicity more often? Did you try the conspiracy theory link at the end?

    • Deiseach says:

      Except you have to wait until you get to the moon to drop your bowling ball and feather and have the theory proved right about “they both fall at the same rate because they are both acted upon by the same force”.

      Or the counter-factual “Okay, you see those two lights in the sky? The big one in the day that goes from east to west and looks like it’s moving around us while we’re staying still? And the small one in the night that moves from east to west and looks like it’s moving around us while we’re staying still? Yeah, well, you can believe the evidence of your senses about the small one but the big one is actually staying still and we’re moving. What do you mean, ‘evidence’? This is SCIENCE!!!”

      Scott’s little slap at the Scholastics is all well and good, but even science has to rely on *spit* philosophy when its debating things for which it has not yet got the physical evidence, especially when it won’t be able to back it up with physical evidence for a couple of centuries.

      • suntzuanime says:

        Except you have to wait until you get to the moon to drop your bowling ball and feather and have the theory proved right about “they both fall at the same rate because they are both acted upon by the same force”.

        I was just doing the thing where I subtly undercut my own points as I make them out of sheer uncontrollable perversity. 😀

    • Gates VP says:

      Personally, I think the whole global warming thing is an even bigger mess than you’ve described 🙂

      Underpinning “global warming” is “man-made climate change”, which seems like a silly smoke-screen debate. In fact, even the premise of “global warming” is a really silly debate.

      There’s a 100% that humans having a dramatic influence on the earth’s climate. And I really mean that 100%. We are clearly performing dramatic modifications to our climate and we’re clearly doing so at a faster rate than previous species.

      So back to “global warming”… what if it’s really “global cooling”? Does it actually matter? I mean, what if it was “global cooling”, would we start burning down forests en masse just to “fix” the problem?

      All of this debate back and forth is just ignoring the real problem: the planet is changing, how do we want to adapt, how do we want to mold it?

      I mean, we want to slow CO2 emissions because they also contain lots of things that unhealthy for us to breathe. Does it really matter if that cools or warms the planet?

      • anon says:

        Your comment is one of the worst I’ve ever seen on this website. It matters whether or not CO2 warms or cools the planet.

        First, we would need to pursue different policies in response to each possibility because they involve different scenarios for danger. Global warming might be stoppable by throwing giant ice cubes into the ocean, but if there’s global cooling that’s the last thing we should do. Second, they would have different consequences. Maybe warming would kill one billion people while cooling would only kill one million.

        If CO2 is causing cooling and cooling is bad, we should want more trees and not fewer. Cooling is the opposite of warming but that doesn’t imply that the opposite of one problem’s solution will solve the other.

        Policies aren’t determined in a vacuum. If the only harmful consequence of CO2 is its effects on our lungs, it might not be a good to switch to other types of energy because the downsides might outweigh the upsides.

        • Gates VP says:

          If CO2 is causing cooling and cooling is bad…

          This is the core issue. We have no way coherent strategy for calling the change “good” or “bad”.

          Global warming might be stoppable by throwing giant ice cubes into the ocean,…

          There’s also the assumption here that we could come up with a plan for to fix something that we’re not really sure is broken.

          Is the goal here really to keep the earth at a static temperature for the remainder of human existence?

      • MugaSofer says:

        “There’s a 100% that humans having a dramatic influence on the earth’s climate. And I really mean that 100%.”

        No, you really don’t.

        http://www.lesswrong.com/lw/mp/0_and_1_are_not_probabilities/‎

        • Gates VP says:

          Fine, this is a heavy statistics blog.

          The only way we can argue this would require a lot of semantics about the definition of the words “climate change”.

          We dropped a pair a nuclear bombs in Japan several decades ago. If those effects of those bombs do not count as “climate change” for that region, then we need to be very specific about how we’re going to define “climate change” and attempt to measure the “man-made” effects thereof.

      • Gates, you said,

        We are clearly performing dramatic modifications to our climate and…doing so at a faster rate than previous species. So back to “global warming”…what if it’s really “global cooling”? Does it actually matter?

        No, it doesn’t matter. Climate is a complex dynamic system. At the very least, it makes sense to conserve non-renewable resources because we’ve nearly depleted them. Agreement and action on some basics like that would help. So would some honesty about carbon credit schemes (neoliberal economics is too easy to game) and boondoggle solar tax credits/government funds that corrupt Green types have used for personal enrichment, repeatedly.

        Your comment didn’t deserve to be called “one of the worst ever seen on this website”! This is NOT a heavy statistics website. Arguing over your rhetorical use of Prob(x) = 1.0 is petty. Ignore your detractors here. You are correct; they are more wrong.

  2. jaimeastorga2000 says:

    The “Experimenter Effects And The Remote Detection Of Staring” link is broken.

  3. Ken Arromdee says:

    How do you distinguish
    1) Psychic researchers get as many good results as normal researchers because both sets of researchers are equally sloppy, and
    2) Psychic researchers get as many good results as normal researchers because psychic researchers are worse at research than normal researchers (raising the level of positive results) but this is compensated for by the fact that psychic powers are not real (reducing the number of positive results)?

    In other words, you’re basing this on the premise that psychic researchers are exactly the same as regular researchers except that they’re researching something that’s not going to pan out. I see little reason to believe this premise. For instance, I would not be very surprised if gullibility or carelessness is correlated with belief in psychic powers, and willingness to do psychic experiments is also correlated with belief in psychic powers.

    • Scott Alexander says:

      I doubt psychic researchers are just as good as normal researchers (though a few are) and I agree that if I meant “control group” literally, in terms of trying to find the quantitative effect size of science, this methodology wouldn’t be good enough.

      I’m using control group more as a metaphor, where the mistakes of the best parapsychologists can be used as a pointer to figure out what other scientists have to improve.

      • Ken Arromdee says:

        But once you concede that psychic researchers aren’t really like ordinary researchers , you have little reason to believe that psychic researchers will make the same sorts of mistakes that ordinary researchers do. Even if you find single examples of both types making the same mistake, you have no reason to believe that the distribution is the same among both groups. It could be that a mistake that is common among psychic researchers is rare among normal ones, and focusing on it is misplaced.

        I’m also generally skeptical about using X as a metaphor when X is not actually true.

        • Johann says:

          With respect, I provided in my atrociously long reply a series of arguments, with evidence, that parapsychologists are at least as good as mainstream researchers in most respects, and significantly better in others. Skeptics like Chris French concur with me; see his recent talk (https://www.youtube.com/watch?v=ObXWLF6acuw) for evidence of this.

          Mosseau (2003), for example, took an empirical approach to this, and compared research in top parapsychology journals like the Journal of Scientific Exploration and the Journal of Parapsychology with mainstream journals such as the Journal of Molecular and Optical Physics, the British Journal of Psychology, etc, finding that, most of the time, fringe research displayed a higher level of conformance to several basic criteria of good science. This includes the reporting of negative results, usage of statistics, rejection of anecdotal evidence, self-correction, overlap with other fields of research, and abstinence from jargon. While I’m aware most of these don’t directly impact quality of experimentation, they do provide respectable evidence that parapsychologists are actually about as careful, in their scientific thinking, as most anyone else.

          Moreover, research trying to establish a link between belief in psi phenomena and measures like IQ and credulity has been for the most part unsuccessful, finding that belief in psi does not vary according to level of education (although belief in superstitions like the power of 13 does).

          Finally, there is the fact that skeptics have directly involved themselves in the critique and even design of parapsychological studies. Ganzfeld studies after 1986, for example, owe much of their sophistication and rigor to Ray Hyman, who coauthored a report called the Joint Communique with Honorton, a parapsychologist, where a series of recommendations were specified whose implementation would be convincing, and have been widely adopted today.

          I discuss many more examples of sophistication in parapsychological research; see, again, my absurdly large post for these details.

  4. Sniffnoy says:

    Some nitpicking:

    When you talk about probabilities of 10^7 and such, obviously, this should be odds ratios. I mean, these are pretty similar when you’re talking about an odds ratio of 10^-11, not so much when it’s 10^7.

    Also, some writing nitpicking:

    By my count, Bem follows all of the commandments except [2] and [8].

    You seem to mean [6] and [10]? (Rest of paragraph similarly.)

    Other fields don’t have this excuse. In psychotherapy, for example, practically the only consistent finding is that whatever kind of psychotherapy the person running the study likes.

    You seem to have left out the verb in this sentence?

    • Val says:

      “In psychotherapy, for example, practically the only consistent finding is that whatever kind of psychotherapy the person running the study likes.”

      What they found was which kind of psychotherapy the experimenter likes. Not the best-worded the sentence could be, but the point is in there.

  5. Sniffnoy says:

    Also: Male Scent May Compromise Biomedical Research. Not actually related other than being another instance of “science is hard”, but I thought that you’d find it amusing and that it was worth pointing out.

  6. Eliezer Yudkowsky says:

    It’s possible that you and I and some of the most experienced scientists and statisticians on the planet could get together and design a procedure for “meta-analysis” which would require actual malice to get wrong. I’ll be happy to start the discussion by suggesting that step 1 is to convert all the studies into likelihood functions on the same hypothesis space, and step 2 is to realize that the combined likelihood functions rule out all of the hypothesis space, and step 3 is to suggest which clusters of the hypothesis space are well-supported by many studies and to mark which other studies must then have been completely wrong.

    Until that time, meta-analyses will go on being bullshit. They are not the highest level of the scientific pyramid. They can deliver whatever answer the analyst likes. When I read about a new meta-analysis I mostly roll my eyes. Maybe it was honest, sure, but how would I know? And why would it give the right answer even if the researchers were in fact honest? You can’t multiply a bunch of likelihood functions and get what a real Bayesian would consider zero everywhere, and from this extract a verdict by the dark magic of frequentist statistics.

    I can envision what a real, epistemologically lawful, real-world respecting, probability-theory-obeying meta-analysis would look like. I mean I couldn’t tell you how to actually set down a method that everyone could follow, I don’t have enough experience with how motivated reasoning plays out in these things and what pragmatic safeguards would be needed to stop it. But I have some idea what the purely statistical part would look like. I’ve never seen it done.

    • This is the first time I’ve ever wished for an upvote button on a WordPress blog. Everything Eliezer says here.

    • Josh H says:

      Rather than try to come up with an infallible procedure for doing valid science, it might be simpler and more productive to tweak the incentives. In other words, separate the people who perform the experiments from the people who generate the hypotheses. I just wrote up some quick thoughts on what that might look like.

      • suntzuanime says:

        The problem with is that in non-pathological science there is a lot of interplay between experimentation and hypothesis-generation. It used to be that science was “do an experiment to figure out how the world works” rather than “decide everything in the world is fire then do an experiment to see if that’s true”. The latter is still better than deciding everything in the world is fire and not bothering with experiment, but it injects a lot of friction, especially into exploratory work.

        A slightly modified version of your proposal might separate reaching conclusions from proving them. You wouldn’t outsource your experimentation, you’d still do your own experiments. But their results would be considered preliminary, and you’d need to have your results replicated by a replication lab in order to be stamped as official science and published in the serious portions of serious journals.

        • Josh H says:

          Yeah, I think that’s a much better way of putting it. Discovery could still be experimental, but things like “putting it in a peer reviewed journal” could be outsourced.

      • Incentives can only select from what people can figure out how to do.

        http://www.youtube.com/watch?v=O4f4rX0XEBA

        If you don’t want to watch the whole thing, you could start at about 3:18.

        • Josh H says:

          I agree that changing incentives can’t make people start doing something they don’t know how to do.

          People do know how to do experiments to disprove a hypothesis, though. What they don’t know how to do is systemically prevent experimenter bias from systemically warping the design of and statistical interpretation of such experiments, leading to continual production of false positives.

          If we can set up incentives such that the experimenter’s bias is orthogonal to the hypothesis being true or false, we would still expect some false negatives and false positives, since science is hard, but we’d expect them to statistically average out over time instead of accumulating as entire disciplines worth of non-results.

    • Kevin C says:

      step 1 is to convert all the studies into likelihood functions on the same hypothesis space

      While I agree that if the procedure you propose were possible it would be helpful, I’m skeptical that step 1 is possible* outside the hardest of sciences. Sure, in physics, math, computer science, and maybe chemistry you can define the hypothesis space clearly. However, once you go even as far as molecular biology ex vivo, the hypothesis space becomes too difficult to measure, much less convert the original English & jargon description of the hypothesis into a proper representation of the hypothesis space. (Some in vitro biology may still be measurable, but as soon as you’re dealing with even the simplest living cells, you’ve got hundreds of proteins that have to be part of the hypothesis space, even if the hypothesis on that dimension is merely “protein X” may be present but has neutral effect on the measured outcomes.)

      * That is, not possible for modern day humans. I’m agnostic on whether even a super-human AI could correctly represent the hypothesis space of a molecular biology paper. I think you get into encoding issues before you get to hypotheses that complex.

  7. Douglas Knight says:

    There is another way to do placebo science – subtly sabotage the researcher’s experiment. This is routinely done to all real science students (ie, physics majors at maybe 20 schools).

    Three find large positive effects, two find approximate zero effects, and two find large negative effects. Without doing any calculatin’, this seems pretty darned close to chance for me.

    The effects of chance assuming the null hypothesis are much more specific than the average effect being zero. If your sample sizes are adequate, there should be no large effects, by definition of “adequate.”

    ━━━━━━━━━

    An analogy occurred to me, comparing the pyramid of evidence to Lewis Thomas’s take on medicine. His “Technology of Medicine” said that real medicine requires understanding and allows cheap immediate cures. In contrast, most real-world medicine is expensive use of techniques that barely work and do so for no apparent reason. The pyramid of evidence is purely a product of medicine, attempting to evaluate treatments that have tiny effect sizes. With no understanding, the only evaluation method is large samples. But it is not merely a tool for fake medicine, it is an example of fake science.

    • Douglas Knight says:

      I corrected my citation from Lewis Thomas’s Youngest Science to an essay in his Lives of a Cell. But the specific work is unimportant because you should read everything he wrote. Not just Scott, but also you. Sadly, that is only a few hundred pages.

    • gwern says:

      There is another way to do placebo science – subtly sabotage the researcher’s experiment. This is routinely done to all real science students (ie, physics majors at maybe 20 schools).

      I don’t follow. How are physics majors’ experiments being sabotaged?

      • Douglas Knight says:

        The TA comes in at night and miscalibrates equipment. I don’t know the details. It is probably hard to cause qualitative changes, such as to move them into full placebo condition. Instead they get the wrong numeric answer or unexpectedly large error bars.

  8. Kevin says:

    Bem definitely picked up a signal. The only question is whether it’s a signal of psi, or a signal of poor experimental technique.

    But as Bem points out, demanding high effect size limits our ability to detect real but low-effect phenomena. Just to give an example, many physics experiments – like the ones that detected the Higgs boson or neutrinos – rely on detecting extremely small perturbations in the natural order, over millions of different trials.

    The point I’m about to mention, suggested by these two excerpts, is mostly covered in the Experimenter Effect section, but in a way that seems somewhat indirect to me. That point is systematic uncertainty. Particle physics experiments can confidently capture small effects because – in addition to commandments 1, 2, 4, 5, among others – we spend a great deal of time measuring biases in our detectors. Time spent assessing systematic uncertainty can easily make up the majority of a data analysis project. The failure to find (and correct or mitigate, if possible) systematic biases can give us results like faster-than-light neutrinos.

    Of course, it is much easier to give this advice than to take it and apply it to messy things like medicine or psychology. I freely admit that I would barely know where to start when it comes to such fields. Systematic uncertainty is an important topic in this type of discussion, though.

  9. Can anyone think of a remotely sensible explanation for the Wiseman and Schlitz result? Right now, “skeptics emit negative auras, or the enthusiasm of a proponent is a necessary kindling for psychic powers” is looking pretty good.

    If someone were raising money to fund a replication of this experiment, I would totally consider donating.

    • nydwracu says:

      Or psi ability is distributed unequally, and people with more of it observe/notice it firsthand and so are more likely to believe in it. Or psi doesn’t exist and the RNG has a sense of humor.

      Or:

      Most participants were run by whichever experimenters was free to carry out the session, however, on a few ocassions (e.g., when a participant was a friend or collegue of one of the experimeters) the experimenter would be designated in advance of the trial. Thus most participants were assigned to experimenters in an opportunistic, rather than properly ‘randomised’ (e.g., via random number tables or the output of an RNG), way.

      Something really weird could have happened there, but I have no idea what bias could have been added by that that would produce those results.

      It’s probably the RNG’s sense of humor. But it would be interesting to see someone steelman psi.

    • Scott Alexander says:

      Yeah, I don’t know. This is one place where, contrary to the spirit of this post, I’m pretty willing to accept “they got a significant result by coincidence”. I’d also donate to a replication.

      • Shmi Nux says:

        Replicating this experiment is a wrong way to go. It is designed to detect psi, but instead uncovers a more interesting effect of participant-dependence of psi-detection, which is worth studying, by constructing a separate experiment explicitly for that purpose. Once the dependence part is figured out it makes sense to review the original protocol.

      • nydwracu says:

        If they got a significant result by coincidence, what of their earlier results? It’s possible to explain it by saying that Schlitz had consistently bad methods and they got a significant result by coincidence…

        It could also be a really weird chemical thing somehow? Like, if intending to creep someone out results in subconscious emission of chemicals that can produce the effect of feeling creeped out in someone sitting in a room #{distance} away? (I think of that because of that rat study.)

    • Deiseach says:

      Well, if the placebo effect works positively, in that you can think yourself better if given a sugar pill and told it’s a powerful new medicine, maybe there’s a negative effect as well?

      Perhaps “skeptics interfere with the vibrations” isn’t just an excuse by fraudulent mediums as to why they can’t produce effects (translation: they don’t dare try their conjuring tricks) in the presence of investigators.

      If a skeptic is running an experiment with the conscious attitude “I am doing impartial science here” but all the time in the back of his mind he’s thinking “This is hooey, I know there’s no such thing as telepathy/precognition/what have you, this is not going to work”, maybe that really does trigger some kind of observer effect?

      (I’m not even going to try and untangle Schrodinger’s cat where if you go in with a strong expectation that the cat is dead, would this skew the likelihood of the cat being dead when you open the box beyond what you’d expect from chance?)

      To be fair, I’m sceptical myself about measuring galvanic skin changes; I wouldn’t hang a rabid dog on the evidence of a “lie detector”, and I’m as unconvinced as Chesterton’s Fr. Brown in the 1914 story “The Mistake of the Machine”:

      “I’ve been reading,” said Flambeau, “of this new psychometric method they talk about so much, especially in America. You know what I mean; they put a pulsometer on a man’s wrist and judge by how his heart goes at the pronunciation of certain words. What do you think of it?”

      “I think it very interesting,” replied Father Brown; “it reminds me of that interesting idea in the Dark Ages that blood would flow from a corpse if the murderer touched it.”

      “Do you really mean,” demanded his friend, “that you think the two methods equally valuable?”

      “I think them equally valueless,” replied Brown. “Blood flows, fast or slow, in dead folk or living, for so many more million reasons than we can ever know. Blood will have to flow very funnily; blood will have to flow up the Matterhorn, before I will take it as a sign that I am to shed it.”

      “The method,” remarked the other, “has been guaranteed by some of the greatest American men of science.”

      “What sentimentalists men of science are!” exclaimed Father Brown, “and how much more sentimental must American men of science be! Who but a Yankee would think of proving anything from heart-throbs? Why, they must be as sentimental as a man who thinks a woman is in love with him if she blushes. That’s a test from the circulation of the blood, discovered by the immortal Harvey; and a jolly rotten test, too.”

      • nydwracu says:

        Doesn’t matter if galvanic skin changes aren’t related to anything — if they aren’t a sign of any change in mental state or similar, but can still be affected by psi, then there’s still something that is affected by psi.

    • moridinamael says:

      Let’s say for the sake of argument that Schlitz gave off weird, nervousness-inducing vibes. Interacting with him made their galvanic response fluctuate more *in general,* as they sit in the testing room thinking about what a creep he is. So when Schlitz was staring at them, the meters are recording samples which are going to be sampling from a different distribution of agitation valence. Maybe they picked a bad statistical lumping function on top of this.

      This was the first thing I thought of.

    • Anonymous says:

      It’s possible the 20m distance and however many walls wasn’t enough for sensory isolation, and one of the starers made detectable sound when moving to look at/away from the screen.

      Blinding the sender to the experimental condition would avoid both accidental and malicious back channels like this. One possible design would be to have the video be sometimes delayed by 30 seconds, which would let you separate the effect of “the receiver is being watched” and “the sender thinks they’re watching”.

      • Jonas says:

        One possible design would be to have the video be sometimes delayed by 30 seconds

        If the average time of travel from greeting the researcher to entering the view of the camera is small enough that a 30 second difference is noticeable, there is still a signal (“how long until I see the person on the screen?”) which the “sender” could pick up on. Idea for getting around this: have a stooge delay people 30 seconds in the hall if their video isn’t delayed, or have them sit in one of two rooms with unequal distance, the video delay = the average travel time differential. Or just instruct the sender to delay turning on his screen for X seconds, where X >= the slowest plausible travel time plus the video delay.

  10. John says:

    “In psychotherapy, for example, practically the only consistent finding is that whatever kind of psychotherapy the person running the study likes. ” – I think you accidentally a word.

  11. Sarah says:

    I think this is an argument for including cruder, common-sense heuristics for thinking about scientific studies.

    *Effect size. If the effect is not *very* large and the results *very* unequivocal, you probably either have an illusory result or an artifact of misunderstood structure (A appears to *sometimes* cause B if A is really two things, A1 and A2, and only A2 causes B.)

    *Physical plausibility. You don’t believe in psi because it violates physics. You also shouldn’t believe that drugs that don’t cross the blood-brain barrier can have effects on the brain.

    *Analogy. If people have been finding MS “cures” for decades, the next MS cure isn’t so credible.

    *Multiple independent lines of reasoning. Evolutionary, biochemical, and physiological arguments pointing in the same direction. Especially simple arguments that are hard to screw up.

    *Motivation. Cui Bono. Yes, we care who funded it.

    I think what we’re finding is that *blind* science, automatable science, “studies say X and this quickly-checkable rubric confirms the studies are good”, isn’t a good filter.

    To be fair, rubrics could stand a lot of improvement. (Cochrane, btw, *does* pay a lot of attention to experimental design.) I do think an ideal meta-analysis could do a lot better than the average meta-analysis.

    But the purpose of methodology is to abstract away personal opinion. We don’t do this *just* to better approach truth. We also do it to avoid getting in fights. We want to be able to claim to be impersonal, to be following a documented set of rules. In an era where the volume of science is such that all metrics will be gamed and cargo-culted, methodology may just not be enough. Old-fashioned, pre-modern heuristics like “is this an honest man?” and “does this make sense?” are unreliable, to be sure, but they’re unreliable in a different way than statistics and procedures, and it may be time to consider their value.

    • Michael Edward Vassar says:

      The hardcore take on funding bias is to just consider any study unworthy of consideration if it was funded *at all*. That’s actually what I’d like to enable, and what everything I’m working on is an attempt to build towards.

      • Jonas says:

        consider any study unworthy of consideration if it was funded at all

        Reading it literally, that is only possible if the study doesn’t incur any expenses. Here are some potential expenses: researcher salary, subject compensation, equipment.

        I can see how you do science with no researcher salary, if the researchers themselves are independently wealthy* or have the necessary free time, and I can see uncompensated subjects participating for the fun or to promote knowledge. But building all your equipment (e.g. particle accelerators or microscopes) from scratch, no buying of any components, not harnessing specialization and trade? ‘You nuts?

        * That would make science a hobby of the aristocracy, just like in the Good Old Reactionary Days (I’m not a (neo-)reactionary).

        Something I would expect from a Normal Person—you know, the kind who mostly doesn’t comment on blogs like this one—would be to at least allow self-funding, i.e. allowing the aristocrats and hobbyists to buy their own equipment. Maybe what you meant was to taboo funding other than through The Right Channel, which is what? Government subsidies to basic science? But why is government agenda less distorting than other agendas? Okay, suppose it isn’t and gives the individual researchers free reign; why is their agendas less distorting?

        If your goal is for agenda influences to all wash out and simply fund people who (in aggregate) do Pure Truth-Seeking Science, it’s not clear that self-funding or un-funding or government-funding or government-plus-private-funding is the right (or wrong) way to achieve that effect.

        Comments? Have I horribly distorted what you said?

    • Anonymous says:

      The problem with physics argument is that there could be yet unknown mechanism causing the effect. Like drugs that don’t cross blood-brain barrier but would happen to be radioactive. If we wouldn’t know about radioactivity and couldn’t separate radioactive drugs from others, it would seem like violation of physics and could happen in just some labs..

  12. Jacob Steinhardt says:

    In regards to the Wiseman & Schlitz paper, the sample size is quite small and p-value is only 0.04. Shouldn’t one major possible explanation be: this happened by chance?

    • Daniel says:

      (This is simplifying issues and ignoring fundamental problems with nullhypotheses testing.)

      Let’s imagine two studies. Study A has a sample size of 100 and the p-value to reject H_0 is 0.04. Study B has a sample size of 1000 and the p-value to reject H_0 ist 0.04.

      Question: What’s the difference in the probability to falsely reject the H_0 between the two studies?

      • Johann says:

        There is no difference. P(i ≤ α| H0), where “α” is any threshold (e.g. 0.05) and i is the p-value of a study, does not depend upon sample size, n. There’s nothing particularly surprising about this, though, IMO; people often think there is because they mix it up with P(i ≤ α| Ha), or power, which increases as n increases.

        The second study had only a larger sample size; given the same p-value, then, this logically implies it had a much smaller effect size. The converse is true: the first study had a much larger effect size, which made up for its small sample size. The result was the same p-value.

        IF there’s a real effect, the p-value of a study should asymptotically converge to zero, but if there’s no effect, the p-value will hover around zero regardless of sample size, just as the measured ES does. This is pretty intuitive to me.

        But the commenter you replied to is correct with respect to *practical* considerations such as: large studies being of generally better quality, less likely to be non-representative of the true ES, etc, etc.

        • Daniel says:

          Indeed, larger samples are better (see e.g. recently discussed in Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S. J., and
          Munafo, M. R. (2013). Power failure: why small sample size undermines the reliability of
          neuroscience. Nature Reviews Neuroscience 14, 365ff. for more details, representativity may not be achieved by just more data, though, see the discussion of failures of “big data”-analysts to actually consider these statistical issues).

          Anyway, I wanted to reject the idea that the *combination* of “high” p-values and small sample size is the problem. Indeed it’s quite reasonable to use less strict significance thresholds for smaller sample sizes and on the other hand a high p-value does not become somehow better or more reliable if the sample size increases, indeed, as you imply, p-values of 0.04 will be quite meaningless for samples of more than let’s say 10k respondents. Why? Because the p-value indicates only how likely the result would have been if the H_0 that there is no effect at all would be true, a statistical fiction which is usually not a reasonable possibility at all.

        • Anonymous says:

          Actually, a p-value of .04 is meaningless for small sample sizes, but becomes increasingly meaningful as the sample size increases—but not in the direction one might think. For generally reasonable prior distributions, as N increases, a p-value of .04 indicates increasing evidence for the null hypothesis.

          For example, using a common default prior, and assuming a “medium” effect size under H1, a 2-tailed p-value of .04 implies Bayes factors (for H1 vs H0) of 1.7, .64, ,.21, and .07) for N=20, 200, 2000, and 20000, respectively. Only for the largest two sample sizes does the Bayes factor show much support for either hypothesis over the other, and then the favored hypothesis (since BF<1) is the null.

          You can verify the above trend by using the Bayes factor calculator here, although your specific numeric results will depend on the prior you choose.

  13. All of these are extremely bad solutions, since they worsen the bureaucratization of science.

    Back in the eighteenth and early nineteenth century, science was high status. Smart, wealthy, important people, would compete each to be more scientific than the other. The result was that the scientific method itself was high status, and was, therefore, actually followed, rather than people going through bureaucratic rituals that supposedly correspond to following the scientific method. By and large, this successfully produced truth and exposed nonsense.

    • ozymandias says:

      I am unclear about why the high-status scientists wouldn’t preregister their trials, use heterogeneity analyses, look for high effect sizes, try to avoid experimenter effects, make their meta-analyses stronger, and the rest of it. It seems like all that advice could as easily be implemented by smart, wealthy, important people each competing to be more scientific than the other. Indeed, it seems like that’s exactly what they *would* be competing on. “What should we do?” and “how should we do it?” are importantly different questions.

      • Anonymous says:

        For countersignalling reasons! Preregistering your trials etc. signals that you think your scientificity is in doubt, which means you aren’t really a top class scientist. Remember when Einstein said he’d defy evidence that conflicted with his theory because he was so sure it was right? There was no way Einstein was going to be confused for a crank.

    • ckp says:

      We can’t turn back the clock on the bureaucratization of science now, because science is just so BIG nowadays. The amount of science is increasing exponentially (or even superexponentially) with time, and we’ve exhausted all of the “easier” low-hanging fruit results where status might have been enough to make sure you did it right.

      • Michael Edward Vassar says:

        That’s one hypothesis. I don’t consider it to be a credible hypothesis though.

    • Leonard says:

      The problem is not that science is not high status. Perhaps it used to be; I am not sure. But now, it most certainly is. Indeed, that is part of the problem. Science has such high status that we allow its crazier emanations to override common sense.

      The problem is the intersection of leaderless bureaucracy and “funding”. Bureaucratically “funded” science gradually loses its connection to reality, and sinks bank into the intellectual morass from whence it came.

    • Anthony says:

      James, you’re wrong. Back in the 18th and early (to mid) 19th century, scientists were working mostly on non-living systems, where it’s a *lot* easier to be repeatable and to eliminate measurement biases. Until Darwin, almost any study of living, non-human systems was either stamp-collecting or wrong. There were plenty of people in those days who considered themselves “scientists”, studying psychology, sociology, and the like, but outside their respective schools, we mostly consider them cranks, unlike the pioneers of chemistry and physics.

    • Piano says:

      The problem is that Science is high status, but science is not. We’re trying to slowly make the two similar enough that science gets some high status by proxy and by virtue of Scientists accepting scientists into the fold. But, as long as Science is funded by democracies, politics will trump science.

      To an extent, that’s okay. Most people cannot afford for certain things to be destroyed by the truth, so democracy is an inadvertant and effective defence mechanism. Given the existence of people below IQ 125 (the “stupids”) who are members of different groups, we need to A) stick with democracy-controlled Science, B) obfuscate science the right amount so that it’s still allowed yet the smarter scientists can still keep their jobs, or C) Mechanize the whole thing so that the only thing that can still be accused of heresy is systems of mathematics that are necessary for the rest of the economy to function.

      Until someone shows that mathematics, and not the mathematicians, have been accused of heresy, I’m going to be partial towards C.

  14. Doug S. says:

    Bem and other parapsychologists should be required to attempt to publish their papers in physics journals. (Let’s see them massage a p-value all the way down to 0.0000003!)

  15. Doug S. says:

    Incidentally, high school and undergraduate college students disprove basic physics and chemistry in their lab courses all the time…

    • Zach says:

      Sure, but those students are generally shown to be wrong when their experimental methods are critiqued, things that are supposed to be random are randomized, math errors are corrected, or others replicate their experiments. The issue here is that the p-values actually decrease and we become even more certain of the “wrong” results after replication and meta-analysis.

  16. My prior belief in psi was awfully low (though maybe not as low as 10^-20), but a major effect of reading this article and the linked studies has been to greatly increase my belief in its possible existence. This is particularly the case since all of the various arguments and hypotheses about experimenter effects causing these results to appear bear the stench of motivated cognition. And noticing the amount of motivated cognition required to explain away the result makes me place the estimated probability even higher.

    Thanks, Scott. /sarcasm

    • I should add that the other effect of reading this article was to make me much more skeptical about published science, especially in non-STEM sciences, and I think this was the intended result. But of course my default setting was to pretty much disbelieve all published psychology, sociology, and economics anyway.

      • Anon256 says:

        Non-Science/Technology/Engineering/Maths sciences?

        • nydwracu says:

          Medicine isn’t generally included in STEM. Maybe he meant medical science?

          edit: Or social science, given the context.

        • This is what I get for not thinking about acronyms. So let me be explicit: I have high confidence in physics, chemistry, mathematics, computer science, engineering, and biology. I have low confidence in psychology, sociology, anthropology, and many forms of medicine.

          (My actual degree is in linguistics, and I have a middling view of that field. Most theoretical linguistics is trash, but most linguistics is not theoretical linguistics.)

        • Creutzer says:

          What linguistics are you thinking about that is not theoretical linguistics, but also not stamp collecting? Or are you thinking of the stamp collecting?

        • @Creutzer, I’m not familiar with your use of the term “stamp collecting” here. Is this a disparaging term for basic research, i.e. going somewhere, learning an undocumented language, and writing up a grammar for it? If so, then I admit that a lot of the non-garbage linguistics is “stamp collecting”, but I strongly reject the implicit value judgement in that term.

          In any case, I suggest historical linguistics as a branch of linguistics which is neither stamp collecting nor unempirical gas-bagging. Phonetics likewise. Phonology has some very good points, but the theoretical spats over generative/OT models are pretty much useless. Syntax is a wasteland.

          I adopt the maxim “Chomsky is wrong about everything” as a good rule of thumb for both linguistics and politics.

        • Creutzer says:

          Well, stamp collection, whether in biology or linguistics, doesn’t develop explanations. That makes me kind of go “meh”.

          I agree about all your other points. I was just puzzled by the “majority” statement.

          There is one thing to be said for the Chomskyans, though: Ironically, they are the better stamp collectors. Just think of all the details about familiar language like English, German, Dutch and Italian that have been discovered by Chomskyans. And those Chomskyans who do turn do the description of new languages often look at things systematically in a way others wouldn’t, and they less often use the word “topic” in a way that makes me want to strangle them.

    • name says:

      I once had a college professor who believed in all sorts of parapsychology. One day, in the middle of a lecture, she announced that the class was going to do a… I forget the term she used and it certainly wasn’t “mind-reading exercise”, but a mind-reading exercise is what it was.

      This went as follows: Everyone paired off. One person would picture a location important to them for a few minutes, focusing solely on that, and then the other person would close their eyes, notice whatever thoughts came into their head, and try to pick up what the first person was thinking of. Then the two people would switch roles. Class discussion began after the thing was complete.

      Varying degrees of success were reported.

      I noticed myself interpreting the somewhat ambiguous statements of the other person in my pair as in the right general area of it. So it could be that success was only due to suggestibility after the fact.

      But I gave no ambiguous statements; I said the other guy was thinking of a beach in Hawaii. He said that’s exactly what he had been thinking of. But I found a year later that he didn’t like the professor very much: he was a Catholic and she did not like Christianity at all, and so on, and she’d given him a low grade on one class for disagreeing with her. I can’t remember the chronology here, but the class he got a low grade in was the one least unrelated to that particular subject, and I know it was taught in that same classroom. He didn’t say anything in the class discussion either. Nevertheless, it could be that he was bullshitting for fear of getting a bad grade, and thought she’d overhear him.

      It could be that similar explanations apply to the whole class. It could also be that the results this professor had reported herself seeing were the result of poor planning, or bullshitting for effect, or one of a thousand other possibilities. It could also be that every other report of psi, or of any other strange effect that doesn’t quite fit into current theories of physics, had a similar explanation: myth, prescientific explanations pushed forward through cultural inertia, charlatanry, bad experimental design…

      It could be.

      • nydwracu says:

        (Dammit. Gravatar. The response that was supposed to obviously bring about was that sometimes things that can be easily explained by postulating an otherwise surprising entity really are results of pure coincidence. Now I have to figure out another method to elicit the critique of neoreaction I was trying to lead to.)

      • Anonymous says:

        Here’s Derren Brown doing that one: https://www.youtube.com/watch?v=k_oUDev1rME

        • nydwracu says:

          Can’t watch videos on this thing, but it’s interesting that you’d link a pickup artist. The thing in the anecdote actually happened, but what I left out was the professor’s habit of pacing around the classroom. I’m almost certain that the force at work was just the students knowing to avoid contradicting the professor. A clever status-building exercise, but I think she actually believed all of it.

  17. James Babcock says:

    I once met someone who believed she had psychic powers. She described having done a personal experiment, with a mutual friend as experimenter/witness, and gotten a surprisingly large effect size. The experiment involved predicting whether the top of a deck of Dominion cards was gold or copper.

    As it happened, I had played Dominion at this friend’s apartment before, and so I had an unusually good answer to this experiment: I had seen that particular deck of cards before and it was marked. Not deliberately of course, but the rules of Dominion lead to some cards getting used and shuffled much more than others, so if cards start getting worn, they get easy to distinguish.

    That sort of observation would never, ever appear up in a study writeup.

  18. Vanzetti says:

    Hmmm…

    Science gave us a giant pile of utility.

    Parapsychology gave us nothing.

    I feel like this argument is good enough for me to ignore even a million papers in Nature.

    • Randy M says:

      Can you draw a dividing line between what science-like things are is and what parapsychology-like things are is that cleaves the useful from the useless without begging the question?

      • Vanzetti says:

        No. But that’s parapsychology we are talking about. It’s pretty far away from the line I can’t draw, safely on the side of the crazy bullshit.

        Now, psychology on the other hand… 🙂

        • ozymandias says:

          Can you explain your system of ranking things from “more sciencey” to “less sciencey”? Uncharitably I would assume the answer is how high-status it is among skeptics.

  19. Robin Hanson says:

    As a theorist at heart, I’m tempted to adopt an attitude of just not believing in effects where the empirically estimated effect size is weak, no matter what the p-values. Yeah I won’t believe that aspirin reduces heart problems, but that seems a mild price to pay. I could of course believe in a theory that predicts large observed effects in some cases, and weaker harder to see effects in other cases. But then I’d be believing in the law, not believing in the weak effect mainly because of empirical data showing the weak effect.

    • Desertopa says:

      It may *seem* like a mild price to pay, but in practice it leads to, what, more than a thousand avoidable deaths per year? In medicine, failure to acknowledge small effect sizes, when applied over large populations, can result in some pretty major utility losses.

      • Scott Alexander says:

        This was the argument I was going to make too (although I bet it’s way more than a thousand).

        Also, I’m pretty sure you’d have to disbelieve in the Higgs boson and a lot of other physics.

        • Sniffnoy says:

          Also, I’m pretty sure you’d have to disbelieve in the Higgs boson and a lot of other physics.

          No, he already addressed this; he’s talking about purely empirically detected effects with small effect size, not effects with small effect size backed up by theory.

        • Scott Alexander says:

          So would or would not Robin believe in the Higgs boson more once it was detected than he did before when it was merely predicted?

          If the latter, does he think it was a colossal waste of money and time to (successfully) try to detect it?

    • “Yeah I won’t believe that aspirin reduces heart problems, but that seems a mild price to pay.”

      Actually, there are theories about the way aspirin works:

      “Antiplatelet agents, including aspirin, clopidogrel, dipyridamole and ticlopidine, work by inhibiting the production of thromboxane.”

      http://www.strokeassociation.org/STROKEORG/LifeAfterStroke/HealthyLivingAfterStroke/ManagingMedicines/Anti-Clotting-Agents-Explained_UCM_310452_Article.jsp

      The “inhibition” of headaches by aspirin is based pretty much on the very same mechanism, with some minor variations in the biochemical pathways, and differences in the body’s target-areas. But because its effect size is much much bigger than its effect as an antiplatelet, theorists at heart rarely dismissed it (especially when in need…). Aspirin was used to fight headaches for about a century before its anti-headache mechanism was discovered.

      Julio Siqueira
      http://www.criticandokardec.com.br/criticizingskepticism.htm

  20. http://news.sciencemag.org/brain-behavior/2014/04/male-scent-may-compromise-biomedical-research says that lab mice display pain less in the presence of the scent of a human male or his smelly t-shirt. The mice showed 2/3 the pain when near the human male scent (in person or through shirt).

  21. Eric Rall says:

    As I understand it, the actual story of Galileo vs the Scholasticists involved a key role played by the emergence of gunpowder artillery as a major battlefield weapon. Specifically, artillery officers who aimed their weapons using Aristotelian mechanics (cannonball follows a straight line from the muzzle until it runs out of impetus, then falls straight down at a rate dependent on its mass) missed their targets, while those who used the theories developed by Galileo et al (curved path whose curve depends only on the angle and muzzle velocity of the cannonball, not its mass) hit their targets. And because effective use of artillery was becoming a life-and-death issue for various high-status people, those people paid serious attention to what made their artillery officers better at hitting their targets.

    In that light, I’d suggest considering putting applied engineering at the top of the pyramid of science. The ultimate confirmation of a theory as substantially correct (*) has to be the ability to use the assumption of that theory’s correctness in actually making and doing things and to have those things actually work in ways that they wouldn’t if the theory were fundamentally flawed. Of course, as I write this, I’m realizing that while this works great for things like using physics theories to build airplanes and rockets, the “applied science” standard can have really shitty results in fields where the appearance of success is easier to come by accidentally. I’ll leave telling the difference between the two cases as a massive unsolved problem.

    (*) “Substantially correct” has to be qualified because of issues like Newtonian Mechanics being demonstrably wrong in very subtle ways, but it still having practical usefulness because it’s correct enough to correctly predict all sorts of well-understood cases.

    • suntzuanime says:

      If your psychological theory is so accurate, why aren’t you a cult leader?

    • Anthony says:

      I commented below that science with bad epistemology is called “engineering”, but it’s worse than that. Engineering results don’t need a theory, and/or can live quite happily with two mutually-incompatible theories about what’s happening.

      I’m a soil engineer, and some of what we do is downright embarrassing when you look into its theoretical basis. And on how much we extrapolate based on seriously limited data. (Extrapolate, not just interpolate.)

      • It sounds like engineering that’s based on incoherent theory would be fertile ground for finding hypothesis to test to develop better theories. Are people working on that?

        • Anthony says:

          In my field, somewhat. What we’ve got is theory with mathematically intractable application, and/or real-life circumstances with inadequate characterization of the properties so that the theory is impossible to apply. Imagine finding the actual amount of friction in every pair of surfaces in a car.

          Software engineering seems to get along quite well with very little contact with the formal theory of software, and from the outside, it seems that their terrible results stem from overmuch complexity rather than poor theoretical foundations.

  22. somnicule says:

    It seems that naive experiments on sufficiently complicated systems may as well be correlational studies. Along the lines of Hanson’s comment, without a powerful causal model behind the results of an experiment, it’s very hard to draw any meaningful or useful conclusions, particularly in cases of small effect size. Things like throwing medications at the wall and seeing what sticks, just seeing what happens in specific circumstances for psychology experiments, etc. all seems a bit cargo culty. A stop-gap measure until we get solid casual models.

  23. The Nabataean says:

    Suppose this isn’t just lousy protocol, and psi really does exist. All of these purported psychic phenomena are events that seem just a little too farfetched to be coincidence, and happen on the scale of human observers.
    If this is the case (and just so you don’t get the wrong idea, I think that’s a very big ‘if’), that could be evidence in favor of the Simulation Hypothesis. Maybe the beings running the simulation occasionally bias their random number generators for the benefit (or confusion) of the simulated humans. Maybe they want to see how much we can deduce about our world when faced with seemingly inviolable laws of physics that nevertheless seem to be violated. Why? Perhaps they want to find out if they could have missed something in their own model of physics. They think they understand their world fully by now, but then again, there are always some anomalies and fishy results. They might want to run a simulation to see whether, if there really were some monkeywrenches being thrown into otherwise tidy patterns of cause and effect, intelligent beings would be able to infer their existence.

    Is parapsychology the control group, or are we the experimental group?

    • “Psi exists” strikes me as more likely than “We are in a simulation”, and is favored by my internal implementation of Occam’s Razor.

    • Slippery Jim says:

      Simulation Hypothesis and psi?
      Enter Johnstone’s Paradox

      1. The universe is finite.

      2. All phenomena in the universe are subject to, and can be explained in terms of, a finite set of knowable laws which operate entirely within the universe.

      1)If reality is ultimately materialistic and rational, then it could be described in a finite set of instructions and modelled as information.

      2)If it could be modelled in this way, then it will be — at the very least because, given limitless time, all possible permutations of a finite universe should occur.

      3)For every one original reality there will be many such sub-models, and they too will generate many sub-sub-models.

      4)The nature of complex systems means that it is almost impossible for any reality to reproduce itself exactly, indeed there is greater likelihood that the submodels will be mutations of the original, subject to different structures and laws.

      5)Because the models severely outnumber the original reality, or realities, it is therefore more likely that we are living in a universe modelled as information, and it is most likely that it is not identical to the original reality.

      6)Thus Johnstone’s Paradox: if reality is ultimately materialistic and rational, then it is highly unlikely we are living in a materialistic, rational universe.

      [This was advanced in 1987 by Lionel Snell and is roughly isomorphic to Bostrom’s argument.]

      • anon says:

        The conclusion seems flawed. Maybe we’re living in a universe that’s an inaccurate simulation of a different universe. But that has no bearing on whether or not the universe is materialistic or rational. We might not match the original reality, but that doesn’t imply causality doesn’t exist.

  24. Pingback: The Ouroboros of Science | CURATIO Magazine

  25. Alexander Stanislaw says:

    Everyone seems to be assuming that bad epistemology makes for bad science. But does it? One advantage to bad epistemology (ie. normal science in which scientists have an incentive to prove their hypotheses rather than objectively test them) is that correct results are recognized more quickly. You get incorrect results too, but that is the price you pay. I anticipate that a super rigorous approach to science would slow down progress in most fields.

    • Douglas Knight says:

      If you bias towards false positives, then maybe true positives get published quicker, but there’s a difference between “published” and “recognized.” In psychology, everyone is in their own bubble, completely ignoring everyone else’s work. Good work is never recognized.

    • Anthony says:

      Science with bad epistemology is called “engineering”.

      • Alexander Stanislaw says:

        And I’d say that engineering is a massive success from an instrumental rationality perspective.

  26. Chris says:

    Just wanted to leave a note saying that this is outstanding — the best blog post I’ve read this year, I think — and I expect I’ll be referring people to it for a long time. Thanks for writing it! Have you thought about setting up a tip jar of some sort? (Paypal, Gittip, Patreon, etc)

  27. Pingback: A modest proposal to fix science « Josh Haas's Web Log

  28. MugaSofer says:

    I feel suddenly less critical of Mage: The Ascension, a game where reality was entirely subjective/shaped by expectations and the “laws of physics”, aka scientific consensus, were standardized by an ancient conspiracy in order to give humanity a stable footing for civilization.

  29. Johann says:

    Let’s see if we can agree on one thing here, Alexander; I think you’ve written a very intellectually engaging piece, with a great deal of thought behind it—certainly one of the more interesting I have read—but I still have some basic concerns I would like to flesh out. I’ll start off with the caveat that I’m favorably disposed towards psi and parapsychology, and that I’m fairly well invested in researching the field, but I hope you’ll agree with me that we can have a productive exchange despite this most unsupportable conviction :-). If I am correct, all participants including myself will leave with an enhanced understanding, and perhaps respect, for the positions of both sides of this debate.

    I’ll start by noting my most significant argument in relation to your piece: that if all these experiments, as you graciously concede, are conducted to a standard of science that is generally considered rigorous, are we not well-justified in concluding at least this: “The possibility that psi phenomena exist must now be seriously considered”? If not; what, I ask, can we say in defense of scientific practice? For if we allow it of ourselves to conduct numerous experiments of high-quality, designed by definition to eliminate (or at least strongly mitigate) explanations alternate to the one we have decided to test for, and then do not even bequeath to our conclusions—upon finding a positive result—the concession that the original explanation is a viable one, how do we justify our first impetus to scientifically investigate that explanation in the first place?

    To illustrate my difference to your position, consider the following quote from your essay:

    “After all, a good Bayesian should be able to say “Well, I got some impressive results, but my prior for psi is very low, so this raises my belief in psi slightly, but raises my belief that the experiments were confounded a lot.”

    I’m led to question whether you really did not mean to say something slightly different. After all, if we take these words at face value, can we not—satirically—call them a decent formula for confirmation bias? A prior belief is examined with a strenuous test; that test lends evidence against the belief; we therefore conclude the test is more likely to have been flawed (i.e. evidence against our position causes us to reaffirm our belief). How do you counter this? IMO, statistical inference, whether bayesian or frequentist, only allows us to rule out the hypothesis of chance—it says nothing about the methodology behind an experiment. Thus, people only ever accept the p-value or Bayes factor of a study literally if they already believe the experiments were well-done.

    Now let me address some of your specific points, to see if I cannot make the psi hypothesis a slightly more plausible one to you:

    You mention Wiseman & Schlitz (1997), an oft-cited study in parapsychology circles, as strong evidence that the experimenter effect is operating here. I certainly agree. At the end of their collaboration, both had conducted four separate experiments, where three of Schlitz’s were positive and significant, and zero of Wiseman’s were. Their results can only be explained in two ways: (1) psi does not exist, and the positive results are due to experimenter bias, and (2) psi does exist, and the negative-positive schism is still related to experimenter effects. Let’s ignore issues of power, fraud, and data selectivity for now (if you find them convincing, we can discuss them in another post).

    The rub, for me, is that this is an example of a paper that is designed to offer evidence against hypothesis (1)—Wiseman certainly wasn’t happy about it. The reason is that both experimenters ran protocols that were precisely the same but for their prior level of interaction with subjects (and their role as starers), ostensibly eliminating methodology as a confounding problem. Smell or other sensory cues, for example (as was mentioned in the above comments), could not have been the issue; staring periods were conducted over closed circuit television channels, and the randomization of the stare/no-stare periods was undertaken by a pseudo-random algorithm, where no feedback was given during the session that would allow subjects to detect, consciously or subconsciously, any of the impossibly subtle micro-regularities that might have occasioned in the protocol.

    Now, you—understandably, from your position—criticize hypothesis (2), but consider the following remarks from Schlitz and Wiseman, after their experiment had been completed:

    “In the October 2002 issue of The Paranormal Review, Caroline Watt asked each of them [Wiseman and Schlitz] what kind of preparations they make before starting an experiment. Their answers were: Schlitz: […] “I tell people that there is background research that’s been done already that suggests this works […] I give them a very positive expectation of outcome.” Wiseman: “In terms of preparing myself for the session, absolutely nothing”

    The social affect of both experimenters seems to have been qualitatively different; we can say this almost with complete certainty (and it’s not unexpected). If we acknowledge, then, such confounding factors as “pygmalion effects” (Rosenthal, 1969), it would be only rational to conclude that—should psi exist—attempts to exhibit it would be influenced by them. Even more clearly, IMO (and why parapsychologists tend to see this experiment as consistent with their ideas), it was Wiseman who did the staring in the null experiments, and Schlitz who did the staring in the positive ones. Would it not make sense that a believer in psi would be more “psychic” than a skeptic, if psi exists? (or that a person with confidence in their abilities could make a better theatrical performance, or more likely deduce the solution to a complex mathematical problem, if they are not insecure about their skill level?)

    Parapsychologists are only following the data, to the best of their ability. You’ll find that, under the psi hypothesis, the discrepancy in success is relatively simple to explain, whereas under the skeptical explanation we must conclude such a thing as that the most miniscule variation in experimental conditions—so miniscule that it must be postulated apart from the description of the protocol and will likely never be directly identified—can cause a study to be either significant or a failure. We must, in other words, logically determine that our science is still utterly incapable of dealing with simple experimenter bias; not just on the level of producing spurious conclusions more often than not (as Ioannidis et al show), but to the degree of failing to reliably assess literally any moderately small effect. This is itself a powerful claim.

    But I will return to the nature of the psi hypothesis later. For now, I will cover parapsychological experimenter effects more broadly. Consider the following: as we probably both agree, Robert Rosenthal is one of those scientists who has done a great deal of work to bring expectancy influences to our attention; his landmark (1986) book, “Experimenter Effects in Behavioral Research”, for example, has not inconsiderably advanced our understanding of self-fulfilling prophecies in science. Would it then surprise you to learn that Rosenthal has spoken favorably on the resistance of a category of psi studies (called ganzfeld experiments) to just the sort of idea expounded by hypothesis (2)? See the following quote from Carter (2010):

    “Ganzfeld research would do very well in head-to-head comparisons with mainstream research. The experimenter-derived artifacts described in my 1966 (enlarged edition 1976) book Experimenter Effects in Behavioral Research were better dealt with by ganzfeld researchers than by many researchers in more traditional domains.”

    What if I told you that Rosenthal & Harris (1998) co-wrote a paper evaluating psi, in response to a government request, with overall favorable conclusions towards its existence; would you be inclined to read a little more of the literature on parapsychology? (The reference here is “Enhancing Human Performance: Background Papers, Issues of Theory and Methodology”)

    Whatever you believe about psi, I agree with you that examination of parapsychological results can do much to bolster our understanding of setbacks in experimentation; however, I also believe that thinking and examining our many attempts (and there are quite literally thousands of experiments, and dozens of categories, with their own literature) to grapple with potentially psychic effects, have the merit of helping to engender a truly reflective spirit of inquiry, for the reason that they represent precisely that ideal of science that we dream of meeting—using data and argument to resolve deeply controversial, and potentially game-changing, issues.

    On a superficial level, we already have evidence that parapsychology employs much more rigorous safeguards against experimenter effects than most any other scientific discipline. Watt & Naategal (2004), for example, conducted a survey and found that parapsychology had run 79.1% of its research using a double-blind methodology, compared to 0.5% in the physical sciences, 2.4% in the biological sciences, 36.8% in the medical sciences, and 14.5% in the psychological sciences. These findings are consistent with those of an earlier survey on experimenter effects by Sheldrake (1998), which found an even greater disparity favorable to parapsychology.

    Originating out of vigorous debates between proponents and skeptics, however, I find it intuitively plausible that this should be the case (the same amount of vehement scrutiny used to contest telepathy is not used to criticize studies of the effect of alcohol on memory, for example), so these findings—while a bit surprising to me—don’t seem, on reflection, to be very much out of place.

    I think, however—and you will probably agree with me—that I could ramble on about
    safeguards and variables all day, without any effect on your opinion, if I do not discuss the most crucial, foundational issues pertaining to psi. It would be like trying to convince you that studies of astrology have rigorously eliminated alternate explanations; after all, if the hypothesis we would have to entertain is that the stars themselves, billions of miles away, determine our likelihood to get laid on any given day, it doesn’t matter how strong the data is—we will always suspect a flaw.

    I therefore suggest we take a wide-angle view, for a moment, on the psi question. We cannot hope to be properly disposed towards its investigation if we do not—certainly it would be unacceptable to simply absorb the popular bias against it, without critical thought, since that’s exactly the religious mindset we eschew; neither would it be acceptable to enjoin its possibility because we want it to be true, or because it’s widespread in the media.

    Let me first address the physical argument. I’m well-versed in the literature of physics and psi myself, but my friend, Max, is a theoretical physics graduate studying condensed matter physics, with a long-standing interest in parapsychology. He and I both agree that you are overestimating the degree to which psi and physics clash. Before I state why, consider that our opinion is not so unusual, for those who have thought about the question at length; Henry Margenau, David Bohm, Brian Josephson, Evan Harris Walker, Henry Stapp, Olivier Costa de Beauregard, York Dobyns, and others are examples of physicists who either believe that psi is already compatible with modern physics, or else think (more plausibly, IMO) that the current physical framework is suggestive of psi. De Beauregard actually thinks psi is demanded by modern physics, and has written so.

    In light of these positions, you will see that our perspective is not an unreasonable one to maintain. Basically, we agree that if we take physical theory in its most conventional form (hoping thereby to reflect the “current consensus”), psi and physics are just barely incompatible. I say “just barely” because we have such suggestive phenomena as Bell’s EPR correlations, which Einstein himself derided as telepathy, (but which we now have incontrovertibly proved through experiment) that show how two particles may remain instantaneously connected at indefinite distances from each other, if once they interacted. It is true that this phenomenon of entanglement is exceptionally fragile; however, experimental evidence in physics and biophysics journals these days purports to show its presence in everything from the photosynthesis of algae to the magnetic compass of migrating birds, and more such claims accrue all the time. Entanglement is entering warm biological systems.

    The incompatibility arises if we conceive of psi as an information signal; if we think something is “transmitted”; because the no-signaling theorem in quantum mechanics says EPR phenomena are just spooky correlations, not classical communication. You cannot use an entangled particle, as Alice, to get a message to Bob, for example, in physics parlance. However some parapsychologists and physicists don’t think of psi as a transfer, and lend to it the same spooky status as EPR—unexplained non-local influence. If this is correct, and you are willing to accept that non-local principles can scale up to large biological organisms (as the trend of the evidence is indicating), but to a larger degree than has ever been experimentally verified (outside parapsychology, of course), then certain forms of psi are already compatible with physics (e.g. telepathy).

    It may also surprise you to know that the AAAS convened a panel discussion on the compatibility of physics and psi, with numerous physicists in attendance, where the general consensus was that physics cannot yet rule out even phenomena like precognition. The main reason given was that the equations of physics are time-symmetric; they work forwards and backwards equally well. There are, in fact, interpretations of quantum mechanics like TSQM that play explicitly on this principle, with optics experiments unrelated to parapsychology designed to provide evidence for retro-causality (e.g. Tollaksen, Aharanov, Zeilinger). Some of them exhibit results that are rather intuitively and elegantly explained under a retro-causal model, and have more convoluted mathematical interpretations in other frameworks (all QM experiments at this time can be explained by all the interpretations, to various degrees).

    I would talk more about other ways that psi and physics can be reconciled, such as by introducing slight non-linearities in QM, but I sense that this may bury my point rather than clarify it. Where psi and physics are concerned, therefore, I say just this: that if physicists can unblinkingly confront the possibility of inflating space, multiple universes, retro-causality, observer-dependent time, universal constants, black holes, worm holes, extra dimensions, and vibrating cosmic strings—much for which there exists fleetingly little experimental evidence, a good deal of theoretical modeling, and a lot of funding—we cannot, with a straight face, dismiss a considerable body of experimental evidence for something as mundane as telepathy—or of slightly more significance: precognition (the future doesn’t have to be “fixed” to explain these experiments, BTW).

    Now, if you’re looking for evidence that Bem’s experiments themselves, and their replications, were well-conducted, and well-guarded against spurious expectancy effects, thus providing parapsychological evidence for retro-causality, I can only say that I personally think they were, having read the relevant papers and thought about their methodologies. However, my area of expertise relates more to ganzfeld experiments (telepathy), which in my opinion convincingly show that critics have been unable to account for the results. I have been led to this conclusion by personal examination of the data, as well as debate, and in this capacity I have seen every methodological and statistical criticism in the book, as well as every parapsychological rejoinder to them. IMO, no one has yet been able to identify a single flaw in either the ganzfeld experiments or the treatment of those experiments that can successfully account for their results—and none of the major skeptics purport to. I am happy to debate anyone on this issue. A paper authored by myself and Max, in fact, will be coming out in the Journal of Parapsychology in June, on that very subject, if you care to read it; it tackles a number of general criticisms of psi research, with a focus on the ganzfeld, using empirical and theoretical approaches. Look for “Beyond the Coin Toss: Examining Wiseman’s Criticisms of Parapsychology”. At the parapsychology convention in San Francisco this year, as well, we may likely do a presentation on it.

    The rub on the ganzfeld, by the way, is this: where the baseline expected hit rate is 25%, the observed hit rate is 32%, across thousands of trials and more than hundred experiments; and if we partition trials into those where subjects have been specially selected in accord with characteristics predicted to significantly enhance performance, we instead find hit rates of upwards of 40% (with 27% as the average proportion of hits across unselected subjects). A main focus of our paper is the proposal that we can better the ganzfeld experiments by focusing on these special subjects.

    As I wrap up, I will admit to finding this extremely fascinating. Confronted with the kind of findings I discussed, I find myself saying—as we are often impelled to do, in science, by strange data—that it didn’t have to be this way. And it didn’t. For every example of a spuriously successful scientific hypothesis, I would wager that there are a dozen that simply didn’t make it. We could have obtained a packet of null studies in parapsychology, but instead we wound up with a collection of successively improved, robust, and (many) methodologically formidable experiments (many are also poorly conducted) that collectively—in almost every paradigm examined to date—exhibit astronomically significant departures from chance. Intelligent people have performed and assailed these experiments, but no satisfactory explanation exists today for them.

    Is it possible that this seeming signal in the noise is psi? I think so. Physics doesn’t rule it out; some aspects are even suggestive. People, also, have anecdotally claimed its existence for millennia, so psi is not without an observational precedent. Nor is it without an experimental precedent, as many experiments conducted from the turn of the 19th century to today have sought to evidence it, and found results consistent with the hypothesis.

    Should we be surprised at psi? Of course—the phenomenon defies our basic intuitions; I’m not claiming we shouldn’t be skeptical. But we should also be open-minded, and not hostile. Psi touches our scientific imaginations; it has accompanied our scientific journey from its inception (one of the first proposals for the scientific method, made by Francis Bacon was to study psi). It is directly in connection with investigating it, in fact, that several of our favorite procedures came into being, such as blinding and meta-analysis.

    I conclude this commentary with the following point: if, after having obtained results like those I just described and alluded to, as well as those you eloquently summarized for us, all that we can bring ourselves is to say is “there must have been an error someplace”, I respectfully contend that what has failed is not our science, but our imagination. It is not time for scientists to throw out their method; it is not time for us to conclude that an evidence-based worldview cannot survive in the face of human bias; rather, it is time for scientists to become genuinely excited about the possibility of psi. We need more mainstream researchers, more funding, and more support to decide the question. Surely you will concede it is an interesting one. In pursuit of its answer, much is to be gained in understanding either Nature or our own practices in connection with investigating her mysteries.

    * If any of what I have said interests you, I highly recommend reading the following exchange between my colleague, Max, and Massimo Pigliucci (a skeptic), on the topic of parapsychology—especially the comments. The debate is illuminating.

    http://rationallyspeaking.blogspot.com/2011/12/alternative-take-on-esp.html
    http://rationallyspeaking.blogspot.com/2012/01/on-parapsychology.html

    Thank you for an interesting read.

    Best, – Johann

    • anon says:

      TL;DR Version

      – When Scott used Bayes’ Theorem to update towards the result that the study was flawed, that was more or less confirmation bias.

      – Statistics can only tell us about probability not other things (this argument made no sense to me, I might be misunderstanding it).

      – Skeptics block psi effects which is why only believers discover their effects.

      – People are more skeptical of psi than other phenomena, unfairly.

      – Lots of smart people believe in psi and have published studies on it.

      – Physics doesn’t rule out psi. Quantum entanglement provides a mechanism psi might be operating through.

      – Commenter is publishing a study on psi soon with a physicist friend.

      • Johann says:

        Here’s to hoping the above is, more-or-less, facetious?

        An aversion to long comments is understandable, but since it is more liable to occur in connection with positions we disapprove of, the danger of missing interesting challenges to our ideas is there.

        • Randy M says:

          well, it was a very long comment, and a not unfair summary. This could use more elaboration:

          “If this is correct, and you are willing to accept that non-local principles can scale up to large biological organisms (as the trend of the evidence is indicating)”

          What evidence is that referring to?

        • Kibber says:

          With regards to the above comment, and in my personal experience, it’s more about credibility and writing style than disagreement. Scott can write long posts that I would read because I know from experience that his posts are good, plus his writing style is usually engaging all the way through to the end. In the above comment, you lost me around “certainly one of the more interesting” – i.e. before actually revealing any positions. Not necessarily a huge loss, of course, but disagreement had nothing to do with it.

        • Johann says:

          @ Kibber

          No matter how I write, some will perceive me as either genuine, evasive, stupid, or intelligent. And the smaller their sample is of my writing, the less accurate will be their judgements.

          My aim was simply to put forth a counter-perspective to Scott’s in the most genial, open manner possible; and to mitigate misunderstanding it had to be thorough. It is of course my prejudice that I have made arguments which should be considered on a level similar to his; this includes pointing out two technical mistakes on Scott’s part that I hope will eventually be fixed, and a number of observations in parapsychology that seem to me compelling of further study.

          My perspective is that of someone who had directly analyzed portions of parapsychology data—specifically ganzfeld data—and has two modest co-written papers accepted for publication on the subject, claiming that the findings are interesting in specific ways that warrant replication. It is your choice whether to include this perspective in your assessment.

          I am also more than happy to answer any questions you may have about my arguments. A good, well-intentioned debate is hard to pass up.

          Best, – J

      • he who hates deceitful error messages, like that i'm posting "too quickly" says:

        thanks for the precis

    • Scott Alexander says:

      I think on a Bayesian framework, the probability that psi exists after an experiment like this one would depend on your prior for psi existing and your prior for an experiment being flawed.

      This experiment produced results that could only possibly happen if either psi existed or the experiment was flawed, so we should increase both probabilities. However, how *much* we increase them depends on the ratio of our priors.

      Suppose that before hearing this, we thought there was a 1/10 chance of any given meta-analysis being flawed (even one as rigorous as this one), and a 1/1000 chance of psi existing.

      Now we get a meta-analysis saying psi exists. For the sake of simplicity let’s ignore its p-value for now and just say it 100% proves its point.

      In 1000 worlds where someone does a meta-analysis on psi, 100 will have the meta-analysis be flawed and 1 will have psi exist.

      The results of this study show we’re in either the 100 or the 1. So our probabilities should now be:

      1/101 = ~0.99% chance psi exists
      100/101 = ~99.1% chance the meta-analysis is flawed

      I think it’s a little more complicated than this, because we know there are other parapsychological experiments whose success or failure is correlated with this one. It’s probably not true that if a similar meta-analysis came out, I’d have to update to 90/10. And the fact that there have been a lot of studies looking for psi that found none also has to figure in somewhere.

      But this is the basic dynamic at work making me thing of this as “mostly casts doubt on analysis” rather than “mostly proves psi”

      Regarding my low prior, you make a physics argument, and I’m not really qualified to judge. But I don’t find psi physically too implausible, in the sense that I wouldn’t be surprised if, a hundred years from now, scientists can create a machine that does a lot of the things psi is supposed to be able to do (manipulate objects remotely, receive messages from the future, etc).

      My worries are mostly biological. Psi would require that we somehow evolved the ability to make use of very exotic physics, integrated it with the rest of our brain, and did so in a way that leaves no physical trace in the sense of a visible dedicated organ for doing so. The amount of clearly visible wiring and genes and brain areas necessary to produce and process visual input is obvious to everyone – from anatomists dissecting the head to molecular biologists looking at gene function to just anybody who looks at a person and sees they have eyes. And it’s not just the eyes, it’s the occipital cortex and all of the machinery involved in processing visual input into a mechanism the rest of the brain can understand. That’s taken a few hundred million years to evolve and it’s easy to trace every single step in the evolutionary chain. If there were psi, I would expect it to have similarly obvious correlates.

      There’s also the question of how we could spend so much evolutionary effort exploiting weird physics to evolve a faculty that doesn’t even really work. I don’t think any parapsychologist has found that psi increases our ability to guess things over chance more than five to ten percent. And even that’s only in very very unnatural conditions like the ganzfeld. The average person doesn’t seem to derive any advantage from psi in their ordinary lives. The only possible counterexample might be if some fighting reflexes were able to respond to enemy moves tiny fractions of a second before they were made – but it would be very strange for a psi that evolved for split second decisions to also be able to predict far future (also, this would run into paradoxes). Another possibility might be that psi is broken in modern humans for some reason (increased brain size? lifestyle?). But I don’t know of any animal psi results that are well-replicated.

      These difficulties magnify when you remember that psi seems to be a whole collection of abilities that would each require different exotic physics. As weird as it would be to have invisible organ-less non-adaptive-fitness-providing mechanisms for one, that multiplies when you need precognition AND telepathy AND telekinesis.

      • Johann says:

        Well, Scott, you bring up some good points. Thank you for the swift and gracious reply. I will address your argument about Bayesian priors first.

        The way I see it, there is something strange about your approach, and that is that if you use a two prior system, the experimental evidence becomes essentially superfluous—and however logical the grounding for it may be, if this is the case we must be led to seriously question it. Consider: no matter how many meta-analyses we conduct, or how many experiments, if your priors are at 1/10 for flaws and 1/1000 for psi (a reasonable psi prior, BTW), you will always accept the flaw hypothesis. Not to put too fine a point on it, but if we were randomly selecting from infinite possible worlds where we conduct a meta-analysis, we would always have a 100 times greater chance of selecting a world with flaws than with psi, and thus also a 99.1% chance of being in the world with flaws (given a positive MA) and a .99% chance of being in the world with psi. What’s remarkable to me is that it wouldn’t matter how many positive MAs we got, sequentially; these probabilities would hold steady!

        My concern is that the two-prior system is a self-fulfilling prophecy, and not how Bayesian ideas should actually play out. But that’s not to say it’s not unreasonable, in certain contexts, to think this way. We all know human beings have only a limited time to pursue what interests them; from the perspective of a person trying to make rational decisions about what to pursue and what to avoid—for whom parapsychology is just one of a mass of unreliable claims—it makes sense to default on the flaw hypothesis. However, for someone who has decided that the field is worth a more intense form of scrutiny, this analysis cannot be acceptable. I have the burden of proof in this discussion; it is my job to persuade you that parapsychology is worth the latter, and not the former, treatment.

        Even a flaws prior based on empirical estimates of problematic meta-analyses is still only the crudest approximation to the “true” error prior of any MA; what it does is it uses the roughshod general quality of a scientific discipline (in this case all meta-analyses!) to predetermine the amount of evidence in one example—again, a perfectly reasonable approach for estimating the likelihood of success, generally, or of any particular meta-analysis one doesn’t want to invest time in, but rather superficial for a question one wants to have an accurate answer to. I don’t know about you, but I want to have a superbly accurate answer to whether psi exists—it’s of great importance to me, and I will be satisfied with nothing less.

        To the genuinely moved investigator, I think, the first step is to undertake a considerable analysis of the experimental and statistical methodology. The flaws prior must disappear, to be replaced with the condition that if any flaw is found which the investigator deems influential, the resulting p-value of the evidence (or Bayes factor) must be left in serious question until such time as it can be shown that (1) the flaw did not exist or (2) it could not have significantly impacted the data.

        • Scott Alexander says:

          I don’t think it’s as pessimistic as you make it sound.

          In theory, if this meta-analysis dectupled my probability to 1%, the next one that comes out positive should dectuple my probability to (almost) 10%, and so on.

          In practice this doesn’t work because I assume that, if this meta-analysis is systematically flawed, the next one shares those same flaws. But you could raise my probability by showing it doesn’t. For example, if the next meta-analysis fixes the problem I mention above about lack of conclusive evidence in peer-reviewed exact replications, that should raise my probability further, and so on.

          You ask below what evidence I would find conclusive. I don’t think I would find any single study conclusive. But if some protocol was invented that consistently showed a large positive effect size for psi, and it was replicated dozens of times with few or no negative replication attempts, and it was even replicated when investigated by skeptics like Wiseman, I think that would be enough to make me believe.

        • Johann says:

          The problem with the two-prior system is that even if the flaws prior varies between meta-analyses—which I agree is better than keeping it fixed—it still has the effect of turning the Bayes factor of any particular meta-analysis into little more than a large constant by which to multiply your predetermined priors, leaving the same relative odds for your hypotheses at the end of the analysis as at the beginning.

          Essentially, the approach seems to me to be slapping numbers onto what people already do: disregard the Bayes Factor or the p-value if they think flaws are more likely than the result of the MA.

          All I’m saying is that determining that last part—the likelihood that flaws are really a more reasonable explanation—requires an exhaustive, self-driven inspection, if one genuinely wants to avoid a Type II error as well as Type I error.

          I also think you may have misread the results of published exact replications because you left out

          Bem (2012)
          N=42
          ES=0.290

          Savva et al. (2004)
          N=25
          ES=0.34

          Savva et al. Study 1 (2005)
          N=50
          ES=0.288

          Savva et al. Study 2 (2005)
          N=92
          Es=-0.058

          Savva’s studies can be listed as exact replications of Bem, without precognition, because they stated explicitly in their papers that they were replicating Bem (2003), the first series of habituation experiments in Bem (2011) which Bem presented at a PA convention 11 years ago. Bem (2012), also, counts as a legitimate prospective replication of Bem (2011), IMO.

          If you factor in these missing studies, you now have 6/11 positively directional results, five of which display effect sizes from .27 to .34, and only two of which display negative effect sizes of comparable magnitude. Just eyeballing it, I’m willing to bet that a combination of these ES values would result in an aggregate ES of around 0.09, the reported value for exact replications of Bem in general.

        • Johann says:

          I’m also interested in hearing your perspective on the fact that not only was there a general overall result in Bem et al. (2014), but several observations, as well, that run directly counter to what we might predict in the absence of an effect. I’d like to see what you think about the strength of such observations as evidence against the experimenter error/manipulation hypothesis.

          For example:

          (1) Consider that the much talked about “experimenter degrees of freedom” would be expected to more strongly impact conceptual rather than exact replications—hardly a controversial point—yet the ES of the conceptual experiments is lower than that of the exact. Indeed, exact replications have the highest ES of the batch.

          Also, if you read Table 1 carefully, as well as the paragraph before it, you will see that all the experiments listed under “Independent Replications of Bem’s experiments” exclude Bem’s original findings, including the “exact replications”, so the point you make above in your essay that he counted those in the analysis is, I believe, mistaken.

          (2) There were very noticeable differences between fast-thinking and slow-thinking experiments, as well as no obvious reason why this should be according to the experimenter manipulation hypothesis. In particular, every single fast-thinking category yielded a p-value of less than .003, but both types of slow-thinking experiments yielded p-values above .10. Bem gives a good explanation of this in terms of the psi hypothesis, pointing out that online experiments seemed to have very strongly hampered the ES of slow-thinking protocols. It seems to me the experimenter error/manipulation hypothesis would be at a loss to account for this; why should it be that slow-thinking protocols or online administration of these protocols lower incidences of bias?

          I think the safeguards of these experiments pretty much rule out sensory cues; therefore the skeptical explanation must lie in p-hacking, multiple analysis, experimenter degrees of freedom, selective outcome reporting, etc. However (1) seems inconsistent with most of this, the check for p-hacking failed to find a result, and (2) decisively refutes the prediction that such biases evenly affected fast and slow-thinking protocols—the most straightforward prediction we would have made prior to seeing the results.

          I’m not claiming any of this is conclusive, but I am saying that when you think carefully about some of the results in this MA, they take you by surprise. This is the kind of ancillary data that contributes to the veracity of an effect as much or more than the basic overall effect measure, IMO; if these types of suggestive trends weren’t present throughout parapsychology databases, I would find them less convincing.

        • Johann says:

          BTW, Max just told me that he did the sample-size weighted mean calculation for the ES values I report plus the ones you report: the result is 0.0802. This exactly confirms what I said in my above post, and refutes the contention that published exact replications of Bem’s studies fail to replicate the results of the 31 reported exact replications, published or not, in Bem et al. (2014).

          Max also bets there is heterogeneity across these ES values, given their extreme variance; an I^2 test on them might be appropriate. If moderate to great heterogeneity was found, it would count as further evidence against the null.

      • Johann says:

        I don’t have time to address the physics/biology argument in detail now, but consider just these few points:

        (1) psi is taken by many parapsychologists to be nothing more than an artifact of retrocausality; under the DATS model of psi, which makes testable predictions, psi is just the collateral of a conscious process that slips backwards and forwards in time (we don’t need to factor in long-term precognition right now, since the evidence for that is mostly anecdotal). See: http://www.lfr.org/lfr/csl/library/DATjp.pdf

        EDIT: Retrocausality can explain, BTW, the results of probably most, if not all, psi experiments. It is also one of the ways entanglement has been theorized to work, and it is entailed by the TSQM model of quantum mechanics. Bell’s theorem, if I recall, establishes a non-local, anti-realist, OR retrocausal universe. Most physicists opt for non-locality, but I think the experimental evidence from the physicists I mentioned, as well as from parapsychology, should prompt us to examine the retrocausality option more carefully.

        For a fascinating series of physiological psi experiments which complement Bem’s and offer evidence for presentiment, where another comprehensive meta-analysis has also been published, see: http://journal.frontiersin.org/Journal/10.3389/fpsyg.2012.00390/pdf

        The effect size of the above experiments is considerably larger than Bem’s, on the order of the ganzfeld findings.

        (2) It is true that we haven’t found an obvious organ associated with the “psi sense”, but it is also true that the human body has a number of senses beyond the five—precisely how many is constantly in debate—that don’t have organs as clear as eyes, for example,

        (3) The brain has been rightly termed the most complex object in the known universe; so complex it contains the mystery of conscious experience—the “Hard Problem”—which has bewildered neuroscientists and philosophers for centuries.

        When it comes to spooky physics, consider that we have predicted entanglement to occur in the eyes of birds, with empirical and theoretical evidence (still in debate), as well as in the photosynthesis of algae, and also quantum tunneling in proteins across the biological spectrum; if this is the case, imagine the level of spooky physics that might take place in a human brain.

      • Johann says:

        A final question for now: what level of evidence would convince you that some form of psi exists? What sort of experiment, under what conditions?

      • Troy says:

        There’s also the question of how we could spend so much evolutionary effort exploiting weird physics to evolve a faculty that doesn’t even really work. I don’t think any parapsychologist has found that psi increases our ability to guess things over chance more than five to ten percent. And even that’s only in very very unnatural conditions like the ganzfeld. The average person doesn’t seem to derive any advantage from psi in their ordinary lives.

        One of the most common “psychic” anecdotes that I hear is some variation of the following story: I had a sense that something was wrong with my Aunt Bea, so I picked up the phone and called, and my brother answered and said that she just had a heart attack. I think such anecdotes are especially common among twins and other close relatives.

        Let’s assume that there’s actually some kind of telepathy going on here and that it’s either explained by genetic similarities or close emotional connections. Either way, it seems plausible that being able to (even highly fallibly) tell when your close genetic relative or emotional companion is in trouble could be of significant evolutionary advantage.

        • gwern says:

          Let’s assume that there’s actually some kind of telepathy going on here and that it’s either explained by genetic similarities or close emotional connections. Either way, it seems plausible that being able to (even highly fallibly) tell when your close genetic relative or emotional companion is in trouble could be of significant evolutionary advantage.

          Nice thing about the evolutionary theory is that it suggests quite a few testable predictions: since it seems clear that psi, if it exists, varies between people, if it’s genetically based we should see the usual factor results from sibling/fraternal-twin/identical-twin studies; we should see dramatic increases in psi strength when we pair related or unrelated people in ganzfeld or staring studies (presumably identical-twin and then parent-child bonds are strongest, but we might expect subtler effects like stronger dad-son & mom-daughter communication, weaker communication among people related by adoption etc), we should be able to show decreases in communication between couples who were linked in 1 session but had broken up by the next sesssion, and so on. You can probably think of more implications. (Hm… would we expect people who grew up in dangerous places or countries, and so whose relatives/close-ones would be more likely to be at risk, to have greater receptiveness?)

          On the other hand, when we try to put it in evolutionary terms as kin selection, it casts a lot of doubt on the hypothesis, since the benefit doesn’t seem to be big and so selection would be weak. I’m not a population genetics expert, but I’ll try to do some estimating…

          The benefit: remember the quip – you would sacrifice yourself for 2 siblings, 4 nephews, 8 cousins… How often does one feel worried about Aunt Bea? And how often is Aunt Bea actually in trouble? (It’s no good for the psi sense if it spits out false negatives or positives, it only helps when it generate a true positive and alerts you when the relative is in danger.) Speaking for myself, I’ve never had that experience. Even the people who generate such anecdotes don’t seem to experience such events more than a few times in a lifetime. Imagine that you’re alerted, say, 3 or 4 times in a lifetime, a quarter are correct, and you have even odds of saving their lives single-handedly, and also that they’re still young enough to reproduce, and you costlessly wind up saving Aunt Bea’s life. You’re related by a quarter to Aunt Bea, I think (half comes from mother, mother will be sibling-related to Aunt Bea, so 0.5 * 0.5 = 0.25, right?), then your inclusive fitness gain here is 0.25 * (1/4) * (1/2) = +0.03125. I think these are generous figures and the true s is a lot lower, but let’s go with it.

          So let’s say that if psi were a single mutation rather than a whole bunch, it has a selective advantage of 0.03125. An advantage doesn’t guarantee that the mutation will become widespread through the population, since the original bearers can just get unlucky before reproducing much. One good approximation to estimate fitness is apparently simply doubling the selective advantage (http://rsif.royalsocietypublishing.org/content/5/28/1279.long π≈2s), so then the probability of fixation is 6%.

          So if such a mutation were to ever happen, it’s highly unlikely that it would then spread.

        • endoself says:

          If it appears once it is unlikely to spread. Many mutations that increase fitness by 3% have reached fixation, since the same mutation, or different mutations with the same effects, can occur many times given a large enough population and enough time.

          • gwern says:

            Yes, but the question of how often a psi single mutation would arise leads us to the question of whether that’s remotely plausible (no), hence it must be a whole assemblage of related pieces, and that leads us to the question of how psi could possibly work, much less evolve incrementally; and I’d rather not get into that just to make my point that the fitness & probability fixation can’t be very big.

      • Anonymous says:

        This experiment produced results that could only possibly happen if either psi existed or the experiment was flawed, so we should increase both probabilities. However, how *much* we increase them depends on the ratio of our priors.

        That’s patently wrong. It is fundamental to Bayesian inference that the amount by which our prior odds change in response to new evidence is independent of those prior odds:

        Posterior odds = Bayes factor × Prior odds ,

        where the Bayes factor, the ratio of the marginal likelihoods of the two hypotheses we are considering, quantifies the relative weight of the evidence for those two hypotheses.

        Suppose that before hearing this, we thought there was a 1/10 chance of any given meta-analysis being flawed (even one as rigorous as this one), and a 1/1000 chance of psi existing.

        Now we get a meta-analysis saying psi exists. For the sake of simplicity let’s ignore its p-value for now and just say it 100% proves its point.

        In 1000 worlds where someone does a meta-analysis on psi, 100 will have the meta-analysis be flawed and 1 will have psi exist.

        The results of this study show we’re in either the 100 or the 1. So our probabilities should now be:

        1/101 = ~0.99% chance psi exists 100/101 = ~99.1% chance the meta-analysis is flawed.

        The new evidence increases your odds of the psi vs the non-psi hypothesis; thus it is evidence in favor of the psi hypothesis relative to the non-psi hypothesis. It is fundamental to Bayesian inference that if enough such evidence accumulates, the probability of the psi hypothesis must approach 1 in the limit. However, if you continue to update your odds in the manner you describe, your probability of psi can never exceed 1/2. Thus, no amount of such evidence could ever convince you of the psi hypothesis, in contradiction of a fundamental Bayesian law.

    • Jesper Östman says:

      Interesting points. I looked for 4 studies by Wiseman and Schlitz but could only find 3. What is the study after Wiseman and Schlitz 1997?

      (She only mentions 3 as of February 2013 http://marilynschlitz.com/experimenter-effects-and-replication-in-psi-research/ , and I have failed to find other Wiseman and schlitz papers when googling a bit )

      • Johann says:

        Prior to their collaboration, Wiseman and Schlitz had both carried out staring studies with null results, which prompted their joint experiment. These were Wiseman & Smith (1994), Wiseman et al. (1995), and Schlitz & LaBerge (1994). I count these as semi-evidential; kind of like preliminary studies; which offered results suggestive of an experimenter effect, but which were then reproduced under more rigorous conditions.

        I missed one of those studies though, so the revised count should be Shlitz: 3/4 and Wiseman: 0/5, three of which for Schlitz (2 successes/3) were part of the collaboration, and three of which for Wiseman (3 failures/3) were as well.

        Altogether, given that every success mentioned is an
        independently significant experiment, the statistical fluke hypothesis is unlikely.

        • Johann says:

          * I say “carried out studies with null results” above, but what I really meant was Wiseman got null results and Schlitz got positive results.

  30. Johann says:

    The following popular article in Nature mentions a few examples:

    http://www.nature.com/news/2011/110615/pdf/474272a.pdf

    There is a decent talk on the subject by physicist Jim Al-Khalili, at the Royal Academy, unconnected with parapsychology (he’s got a great bow-tie, though):

    https://www.youtube.com/watch?v=wwgQVZju1ZM

    ^ If the above link doesn’t show, type “Jim Al-Khalili – Quantum Life: How Physics Can Revolutionise Biology” into Youtube instead.

    There are also a number of references:

    Sarovar, Mohan; Ishizaki, Akihito; Fleming, Graham R.; Whaley, K. Birgitta (2010). “Quantum entanglement in photosynthetic light-harvesting complexes”. Nature Physics

    Engel GS, Calhoun TR, Read EL, Ahn TK, Mancal T, Cheng YC et al. (2007). “Evidence for wavelike energy transfer through quantum coherence in photosynthetic systems.”. Nature 446 (7137): 782–6.

    “Discovery of quantum vibrations in ‘microtubules’ inside brain neurons supports controversial theory of consciousness”. ScienceDaily. Retrieved 2014-02-22.

    Erik M. Gauger, Elisabeth Rieper, John J. L. Morton, Simon C. Benjamin, Vlatko Vedral: Sustained quantum coherence and entanglement in the avian compass, Physics Review Letters, vol. 106, no. 4, 040503 (2011)

    Iannis Kominis: “Quantum Zeno effect explains magnetic-sensitive radical-ion-pair reactions”, Physical Review E 80, 056115 (2009)

    You can check Wikipedia, if you like, as well; it has those references and a little bit of information.

  31. Christian says:

    This is why scientists should be humble and embrace constructivism and second-order cybernetics when they write papers.

  32. Troy says:

    Are there any surveys of what percentage of professional psychologists (or other relevant scientists) believe in psi (or think the evidence for it is strong enough to take it seriously)? Presumably said survey would have to be anonymous to get reliable results, since believers might be embarrassed to say so publicly.

    • Johann says:

      That’s an excellent question, actually, and the answer is yes.

      Wagener & Monnet (1979) and Evans (1973) both privately polled populations of scientists, technologists, and academics, and found that between 60-70% of them agreed with the statement that psi is either a “proven fact or a likely possibility” (response bias and other confounding variables exist, though). Consistently low results have been found for belief among psychologists; in Wagener & Monet (1979), psychologists who thought psi was either “an established fact or a likely possibility” were just 36% of the total, compared to 55% natural scientists, and 65% social scientists.

      When it comes to the scientific elite, however, it is another matter. Here, evidence from McClenon (1982) seems to point to unambiguous skeptical dominance, with less than 30% of AAAS leaders holding to the likelihood of psi. Still, 30% is a lot of scientists—especially given these are the board members of the AAAS, the largest scientific academy in the world. Add to that the fact that the Parapsychological Association is an affiliate member of the AAAS—despite a vigorous campaign to remove them in 1979—and you have an interesting situation.

      We must keep in mind that while both of these pieces of information are interesting, they shouldn’t do much to sway our judgement. In both surveys, it is difficult to gauge to what extent the opinions of those polled were formed in response to the empirical evidence.

      You may find the Wagener & Monet (1979) results here: http://www.tricksterbook.com/truzzi/…ScholarNo5.pdf

      A larger body of results is reviewed here: http://en.wikademia.org/Surveys_of_academic_opinion_regarding_parapsychology

  33. Pingback: Links For May 2014 | Slate Star Codex

  34. Pingback: The motto of th… | On stuff.

  35. Ilya Shpitser says:

    People like to mock Less Wrong, saying we’re amateurs getting all starry-eyed about Bayesian statistics even while real hard-headed researchers who have been experts in them for years understand both their uses and their limitations. Well, maybe that’s true of some researchers. But the particular ones I see talking about Bayes here could do with reading the Sequences.

    So what would be your recommendation to an endless list of people from LW from EY on down who say things about B/F that are (a) wrong, or (b) not even wrong. Could they do with reading a textbook?

    If I had to choose between the LW cohort and the stats (or even data analyst) cohort as to who had generally better calibrated beliefs about stats issues, I know who I would go with.

    • Yes, Ilya Shpitser! I am a mere statistician and data analyst, doubter of Jonah Lehrer’s veracity, ignorantly idolatrous in my continued use of Neyman, Pearson and Fisher. I love validation.

      I recognize your name. You had a lively, cordial conversation with jsteinhardt on LW, following his Fervent Defense of Frequentist Statistics. I smiled with delight as I read of your commitment there.

  36. Pingback: Utopian Science | Slate Star Codex

  37. Allan Crossman says:

    Just for the record, I didn’t invent the term “control group for science”, I think that was probably Michael Vassar.

    • Alan Crossman,
      True or not, the term “control group for science” is attributed to you, near and far, all over the internet. The origin seems to be consistent with your (commendably modest, honest) denial, per Douglas Knight comments on She Blinded Me With Science, 4 Aug 2009:
      “I think I’ve heard the line about parapsychology as a joke in a number of places, but I heard it seriously from Vassar.”
      EY replies, thread ends with yet others, e.g. “Parapsychology: The control group for science. Excellent quote. May I steal it?” and “It’s too good to ask permission for. I’ll wait to get forgiveness ;).”

      On 05 December 2009, you wrote, Parapsychology the control group for science. I could find no other, better sources online, attributing it to you or Vassar. Actually, none directly to him, only you.
      Eek! I need to put my time to better use. This is embarrassing!

  38. For the author, Mr. Steve Alexander,
    Placing meta-analyses at the pinnacle of your Pyramid of Scientific Evidence is incorrect. As a practicing frequentist statistician, I am certain. Also, this is one of the few times that I actually agree with Eliezer Yudkowsky! He commented on your post. Substitute “frequentist” for Bayes, and vice-versa, in his comment. The conclusion is the same, in my informed opinion: meta-analyses are less, rather than more, ah, robust, compared to some of the other pyramid levels.

    I mention this with good intent (it isn’t like a tiny missing word). You said,

    There is broad agreement among the most intelligent voices I read (1, 2, 3, 4, 5) about a couple of promising directions we could go.

    No, noooo! Number 3 is notorious science fraud, Jonah Lehrer. Lehrer acknowledged that he fabricated or plagiarized everything. He even gave a lecture about it at a prominent journalism school, maybe Columbia or Knight or NYU, last year, after being found out. You should probably re-think whether you want to cite him as one of the most intelligent voices you read.

    Unlike most critiques of statistical analysis, yours does contain a core of truth!

    People are terrible. If you let people debate things, they will do it forever, come up with horrible ideas, get them entrenched, play politics with them, and finally reach the point where they’re coming up with theories why people who disagree with them are probably secretly in the pay of the Devil.

    I enjoyed that, very much.

    • Scott Alexander says:

      I think you’re misunderstanding. I am posting the standard, internationally accepted “pyramid of scientific evidence”, and then criticizing it. I didn’t invent that pyramid and I don’t endorse it.

      Jonah Lehrer is indeed a plagiarist. He’s also smart and right about a lot of things. Or maybe the people whom he plagiarizes are smart and right about a lot of things. I don’t know. In either case, the source doesn’t spoil the insight, nor does that article say much different from any of the others.

      • I regret being unclear. I meant that I agreed with this, and only this, in EYudkowsky’s comment earlier:

        …meta-analyses will go on being [bullshXt]. They are not the highest level of the scientific pyramid…When I read about a new meta-analysis I mostly roll my eyes.

        Me too!

        I don’t know what this is about,
        “You can’t multiply a bunch of likelihood functions and get what a real Bayesian would consider zero everywhere, and from this extract a verdict by the dark magic of frequentist statistics.”

        When I make my magical midnight invocations to the dark deities of frequentist statistics, open my heart and mind to the spirits of Neyman, Pearson and Fisher, I work with maximum likelihood estimates (MLE’s), not “likelihood functions”. There are naive Bayes models and MLE for expectation maximization algos [PDF!], but I don’t know if EY had that in mind.

        You’ll lose credibility if you continue to claim that Jonah Lehrer is among the most intelligent voices you read. That is, of course, entirely your perogative. I only wanted to be friendly, helpful.

    • gwern says:

      No, noooo! Number 3 is notorious science fraud, Jonah Lehrer. Lehrer acknowledged that he fabricated or plagiarized everything. He even gave a lecture about it at a prominent journalism school, maybe Columbia or Knight or NYU, last year, after being found out. You should probably re-think whether you want to cite him as one of the most intelligent voices you read.

      He acknowledged plagiarizing some things (mostly things I’d regard as fairly trivial and common journalistic sins of simplifying & overstating), but if he plagiarized ‘everything’ I will be extremely impressed. I don’t recall anyone raising doubts about his ‘Decline’ article, involved people commented favorably on the factual aspects of it when it came out, the NYer still has it up, and my own reading on the topics has not lead me to the conclusion that Lehrer packed his decline article with lies, to say the least. If you want to criticize use of that article, you’ll need to do better.

      • Another online acquaintance: Gwern of Disqus comments, who has found (sometimes-amusing) fault with my comments on inane The Atlantic posts.

        So. You like writing about Haskell, the Volokh Conspiracy, bitcoin and the effectiveness of terrorism. Goldman Sachs has not been extant for 300 years. I was saddened by your blithe dismissal of Cantor-Fitzgerald, post-9/11. I worked, briefly, for Yamaichi Securities, on floor 98 of Tower 2, but several years after the 1993 WTC explosion.

        Please consider dropping by for a visit on any of my Wikipedia talk pages. You have 7 years’ seniority to me there. David Gerard is a decent person. He wrote your theme song, the mp3, so you must have some redemptive character traits :o) I am FeralOink, a commoner.

  39. Anonymous says:

    @johann:

    it still has the effect of turning the Bayes factor of any particular meta-analysis into little more than a large constant by which to multiply your predetermined priors, leaving the same relative odds for your hypotheses at the end of the analysis as at the beginning.

    Last time i checked, multiplying a positive number by a large constant resulted in a larger number. You need a review of Bayes factors, multiplication, or both.

    • Johann says:

      I don’t think you understand me clearly: the main utility of Bayesian statistics is that we can update a prior to a posterior, by multiplication with a Bayes factor. On a calculational level, this is all that happens, and indeed Scott’s approach doesn’t break from this. However, when it comes to the actual inference, what should happen is more than this; ideally, we should allow our beliefs to be guided by what those numbers actually represent. Because Scott uses two priors, however, the relative odds of his two competing hypotheses (i.e. there is a true effect and there is not) remains the same before and after any particular statistical test of the evidence. Something about this is just not right, IMO.

      In practice, I know that it is nonsense to believe what the numbers literally say, without skepticism. People should only take statistics at face-value, for areas that really intrigue them, after having satisfied themselves thoroughly that the experiments under analysis are not explainable on the basis of flaws. But after this has occurred, those numbers have real meaning!

      • Anonymous says:

        @Johann:

        I don’t think you understand me clearly

        Actually your new post confirms that I did understand you, and that you don’t understand how Bayesian updating works.

        [W]e can update a prior to a posterior, by multiplication with a Bayes factor. On a calculational level, this is all that happens, and indeed Scott’s approach doesn’t break from this.

        In fact the approach Scott proposed for updating his probabilities was dead wrong, because he made the contribution from the new evidence depend on the prior, which violates one tenet of Bayesian inference; and using his method of updating, his posterior probability for psi can never exceed 1/2, which violates another tenet of Bayesian inference.

        Because Scott uses two priors, however, the relative odds of his two competing hypotheses (i.e. there is a true effect and there is not) remains the same before and after any particular statistical test of the evidence. Something about this is just not right, IMO.

        Bayesian inference always considers (at least) two hypotheses. Often, the second hypothesis is the complement of the first, but this need not be the case. It is perfectly fine to consider the prior odds of an observed effect being due to psi (H1) vs being due to experimental bias (H2). The prior odds is a ratio of two probabilities, P(H1)/P(H2), and is hence a single non-negative number. This number is multiplied by the Bayes factor, which is the ratio of the probability of the data under the psi hypothesis to the probability of the data under the bias hypothesis, and is hence also a non-negative number. Unless this number is 1, or the prior odds 0 or infinity, then multiplying the prior odds by the Bayes factor will result in posterior odds that are different than the prior odds. Clearly, the odds will not remain the same, as you claim.

        • Johann says:

          It may well be that I am technically mistaken in my analysis—I have not deeply studied Bayesian hypothesis testing—but my impression is still that we’re not actually disagreeing on much, although now I have a concern or two about your approach as well. I would be glad to be corrected on any mistake, BTW.

          Firstly, I will be as clear as possible about what I mean. I see Scott’s approach as one that conducts two separate hypothesis tests, both correctly performed. My contention, though, is that it is fundamentally wrong to do *both*. It is clear to me that Scott is juggling four subtle hypotheses, when really he’s only interested in two, to start with: psi exists vs. it does not, and invalidating flaws exist vs. they do not. He sets his prior for flaws at 1/10 (implying a prior of 9/10 for no invalidating flaws) and his prior for psi at 1/1000 (implying a prior of 999/1000 for no psi), multiplies both of them by the Bayes factor of a study, let’s say 300, and obtains two posterior distributions, let’s say 30 to 1 for flaws vs. no flaws and 3 to 10 for psi vs. no psi.

          Now, since the existence of invalidating flaws effectively begets the same conclusion as the non-existence of psi, we can make the following comparison: At the start of the test, the ratio of Scott’s two priors was (1/10)/(1/1000) = 100/1, implying that he favored the flaws hypothesis a hundred times more than the psi hypothesis. Now, the ratio for his posteriors is (30/1)/(3/10) = 100/1, so it is clear that nothing has changed and the very performance of the test was meaningless. If you believe there are likely to be flaws in a study, why update the numbers?

          If you saw something different in Scott’s methodology, feel free to explain.

        • Anonymous says:

          @Johann:

          It is clear to me that Scott is juggling four subtle hypotheses, when really he’s only interested in two, to start with: psi exists vs. it does not, and invalidating flaws exist vs. they do not.

          No. There are only two hypotheses under consideration: H1: Results of experiments purporting to show psi are actually due to psi; H2: Such results are due to bias in the experiments.

          He sets his prior for flaws at 1/10 (implying a prior of 9/10 for no invalidating flaws) and his prior for psi at 1/1000 (implying a prior of 999/1000 for no psi)…

          The two hypotheses quoted above that are complementary to H1 and H2 (ie, the material you have parenthesized) do not enter into the analysis.

          [He] multiplies both of them by the Bayes factor of a study, let’s say 300, and obtains two posterior distributions, let’s say 30 to 1 for flaws vs. no flaws and 3 to 10 for psi vs. no psi.

          No. It works like this: We start with the prior odds of H1 vs H2:

          Prior odds = P(H1)/P(H2) = .001/.1 = .0001 .

          We multiply the prior odds by the Bayes factor for H1 vs H2. If D stands for our observed data (in this case, the results of the experiments in the meta-analysis), then

          Bayes’ factor = P(D|H1)/P(D|H2) = 300.

          And, by the odds form of Bayes’ theorem, we multiply the prior odds by the Bayes’ factor to obtain the posterior odds of H1 vs H2:

          Posterior odds = P(H1|D)/P(H2|D) = 300 × .0001 = .03 .

          So, prior to observing the data, we believed that results purporting to show psi were 10,000 times as likely to be due to bias than to psi. After observing the data, we believe that results purporting to show psi are only 33 (=1/.03) times as likely to be due to bias than to psi. Our new observations have increased our belief that the results are due to psi relative to our belief that they are due to bias by a factor of 300.

          Hopefully that makes sense to you.

  40. Hi Scott,

    I must say that I am sort of a “believer in psi.” Also, I have read pretty much about it over the last ten years, and I have been following to a certain extent the psi-believers vs psi-skeptics debate (on the web, in papers, in books, like “Psi Wars: Getting to Grips with the Paranormal,” etc). Further, I have, on some occasions, taken sides rather fiercely on this issue. Yet, I must acknowledge that many critiques of psi works are of high value. And I did find your evaluation (article) above very interesting and worthy of respect. I would like to comment on a few points:

    “None of these five techniques even touch poor experimental technique – or confounding, or whatever you want to call it. If an experiment is confounded, if it produces a strong signal even when its experimental hypothesis is true, then using a larger sample size will just make that signal even stronger.” … … “Replicating it will just reproduce the confounded results again.”

    I believe that, if only confounding were involved in the issue, there actually would be a drifting in the results, and not confirmation plus confirmation plus confirmation. It would take more than poor standards to veer the results in one direction: it would take some sort of fraud, at least either conscious or unconscious.

    And, the results in the Wiseman and Schlitz’s work was interesting. It is curious that it was not heavily replicated (It might be interesting to find out if it was mostly because of the believers or because of the skeptics…).

    I just would like to add, as gentlemanly as possible (and thus, honour the high level of the debate on this page), that I do not share your view regarding Wiseman. But, that is not the issue, anyway.

    I also think that Johann made very good contributions to the debate on this page. It is nice that he provided a fair amount of information about quantum mechanics’ based biological phenomena, which is an area of knowledge that has been increasing considerably (and robustly) over the last ten years.

    Julio Siqueira
    http://www.criticandokardec.com.br/criticizingskepticism.htm

  41. Pingback: The Bem Precognition Meta-Analysis Vs. Those Wacky Skeptics | The Weiler Psi

  42. Adam Safron says:

    For the Wiseman & Schlitz staring study (or “the stink-eye ‘effect’” as I like to call it), although I haven’t looked closely, I think I might have an explanation for how different results could be obtained with “identical” methods. It wasn’t a double-blind study. Although the person receiving the stink-eye was unaware of when they were being stared at, the person generating the stink was able to monitor the micro-expressions of the participants, and so be influenced in when they ‘chose’ to commence with stink-generation. Under this account, there is a causal relationship, but it goes in the reverse direction, and isn’t mediated by anything spooky, except for the spookiness of the exquisite pattern-detecting abilities of brains.

    • Kibber says:

      At least in the 1997 paper that I looked at, the experimenters used randomly generated sequences of stare and non-stare periods – i.e. the decision to stare or not was truly random and not at-will.

  43. hughw says:

    The analogy of the meta experiment to a control using a placebo is slightly wrong. In giving a subject the placebo, you are causing him to believe it might work. He did not enter the experiment believing it would work. Whereas, the parapsychologists all enter the experiment believing parapsychology is real.

    • he who posts slowly says:

      Parapsychologists are not distinguished by the property of believing their hypothesis is correct.

      • hughw says:

        It’s a premise of this essay. “…the study of psychic phenomena – which most reasonable people don’t believe exists but which a community of practicing scientists does and publishes papers on all the time…. I predict people who believe in parapsychology are more likely to conduct parapsychology experiments than skeptics”

        • Hi Hughw,

          I think you are oversimplifying the issue. And, as to the (one of the) “premise” of this essay, bear in mind that Scott said *most* reasonable people do not believe in psi. He did not say that *all* reasonable people do not believe in psi. Further, it is said above (in your quote) that the *community* believes in psi. But is is not said that *all* the parapsychologists believe in psi.

          Even though we all do not believe in God, Angels, and Demons, the Devil is still in the details…

          Best,
          Julio Siqueira
          http://www.criticandokardec.com.br/criticizingskepticism.htm

        • Anonymous says:

          Psychologists believe in their hypothesis, just like parapsychologists believe in theirs.

  44. >Imagine the global warming debate, but you couldn’t appeal to scientific consensus or statistics because you didn’t really understand the science or the statistics, and you just had to take some people who claimed to know what was going on at their verba.

    You say this immediately after spending 3 sections proving that even in our world, statistics and consensus don’t actually work, but then don’t mention it in this context even to lampshade it.

    There is no way this is accidental, because I know you read Jim’s blog, and his influence on this post is quite apparent, and he makes that argument all the time.

    I’ve noticed this habit you have before where you bust out some extremely interesting argument and then fail to even lampshade the obvious implication. It’s not plausible that it’s an accident, but it’s also too weird for it to be deliberate. I’m confused.

    • Douglas Knight says:

      I very much doubt Scott reads Jim’s blog, outside of Jim’s responses to Scott.

      What influence do you see of Jim on this post?

  45. gwern says:

    Some random comments:

    using statistics like “fail-safe N” to investigate the possibility of suppressed research.

    Nitpick: I think ‘fail-safe N’ should be avoided whenever possible. It assumes that publication bias does not exist, and so simply doesn’t do what one wants it to do. (See http://arxiv.org/abs/1010.2326 “A brief history of the Fail Safe Number in Applied Research”.)

    This scientist – let’s give his name, Robert Rosenthal – then investigated three hundred forty five different studies for evidence of the same phenomenon. He found effect sizes of anywhere from 0.15 to 1.7, depending on the type of experiment involved. Note that this could also be phrased as “between twice as strong and twenty times as strong as Bem’s psi effect”. Mysteriously, animal learning experiments displayed the highest effect size, supporting the folk belief that animals are hypersensitive to subtle emotional cues.

    I agree the Rosenthal results are interesting, but I think the Pygmalion effect is more likely to be an example of violating the commandments & statistical malpractice (Rosental also gave us the ‘fail-safe N’…) than subtle experimenter effects influencing the actual results; see Jussim & Harber 2005, “Teacher Expectations and Self-Fulfilling Prophecies: Knowns and Unknowns, Resolved and Unresolved Controversies” http://www.rci.rutgers.edu/~jussim/Teacher%20Expectations%20PSPR%202005.pdf

    But first of all, I’m pretty sure no one does double-blind studies with rats.

    Not really. They barely do randomized studies. That’s part of why animal studies suck so hard; see the list of studies in http://www.gwern.net/DNB%20FAQ#fn97

    • Anonymous says:

      I think ‘fail-safe N’ should be avoided whenever possible. It assumes that publication bias does not exist, and so simply doesn’t do what one wants it to do.

      Rosenthal’s fail-safe N should never be used, but not because it assumes that publication bias does not exist, but because it is based on the unrealistic assumption that the mean effect size in the unpublished studies is 0. On the contrary, if the true effect size is 0, then the mean effect size in the unpublished studies would be expected to be negative.

      In the Bem et al meta-analysis, the authors calculated, in addition to Rosenthal’s fail-safe N, Orwin’s fail-safe N, which in principle can provide a more realistic estimate of the number of unpublished studies because it allows the investigator to set the assumed mean unpublished effect size to a more realistic, negative, value. But, bizarrely, Bem et al, set the value to .001, actually assuming that the unpublished studies support the psi hypothesis!

      • gwern says:

        Rosenthal’s fail-safe N should never be used, but not because it assumes that publication bias does not exist, but because it is based on the unrealistic assumption that the mean effect size in the unpublished studies is 0. On the contrary, if the true effect size is 0, then the mean effect size in the unpublished studies would be expected to be negative.

        Yes, that’s what I mean: publication bias is a concern because it’s a bias, studies which are published are systematically different from the ones which are not, and the fail-safe N ignores this and instead is sort of like sampling-error.

  46. Hi Scott,

    Regarding your biological concerns (from someone who is highly concerned with biology…):

    First, you say “exotic physics.” We, naturally, have to be careful when using the word “exotic” in this context. For example, the “physics” that almost everybody would point out as being “exotic”, namely Quantum Mechanics, is actually far more ubiquitous even than the Almighty Omnipresent Holy Lord Himself ! (if He exists).

    Then you add that “the amount of clearly visible wiring and genes and brain areas”… …“is obvious to *everyone* ”… …“from anatomists (…) to molecular biologists (…) to JUST ANYBODY WHO LOOKS AT (…) eyes.” (emphasis added). And as a consequence, you expect similar *obvious* correlates. I think we have to remember that even almighty stuff isn’t always (surprisingly enough) obviously “perceptible.” Especially when some sort of canceling out is at play. For example, we all know that electromagnetic force is pretty much almighty (far mightier than gravity; and gravity is not exactly a light weight… – pun intended). Yet, were it not for lightening bolts, even fairly advanced human societies might have passed it by completely, without an inkling of perception of it. See, we have *obvious* physical apparatus for dealing with air (lungs; the inhaling process; exhaling; etc), with water or somewhat solid matter (mouth, teeth, stomach, etc), with visible light (eyes), with sound (ears), etc. But unlike electric fish, we do not have *obvious* biological machinery to deal with “lightening stuff.” Yet, not only is “electricity” itself immensely present in our biological machinery (i.e. even we, humans, take huge advantage of it), electromagnetism actually knits reality tight; and without it, matter would wander astray (even atoms would fall apart). What recent biology is telling us about biological uses of quantum phenomena is that it seems that we don’t have any *obvious* correlate to it in terms of biological machinery. Yet, the correlates that we do have are not only ubiquitous but essential for life. Enzymes work taking advantage of quantum tunneling. Photosynthesis, if my mind serves me well, takes advantage of quantum entanglement. And when I say “take advantage”, what I mean is: cannot do without! So, yes, it is speculative and even unlikely, but… it might just be that the correlates are there. It is just us that cannot see them yet.

    And next you add: “There’s also the question of how we could spend so much evolutionary effort exploiting weird physics to evolve a faculty that doesn’t even really work.”, and you remind us that psi won’t give us a head start greater than five percent over chance (as it seems; Ganzfeld), and then you mention its (apparent) absence in our ordinary lives, and the possible paradoxes of precognition (short and/or long term). Well, what might be happening (if psi exists…) is that we are not really looking at *how* psi works and at *what for* psi works. For example, electromagnetism does not exist for making lightening… Lightening bolts are just minor manifestations of the greater phenomenon of electromagnetism. They are almost “spin offs.” The basic and “true” function of electromagnetism in Nature is to knit reality together (Especially protons and electrons “directly”, as happens in atoms. And, oddly enough, as a consequence of this knitting, electromagnetism becomes pretty much… hidden!). So maybe the function of psi (i.e. the *main* function of it in Nature) is not to get humans synched in telepathy or forewarned in precognition or “physically” unencumbered in telekinesis. These might actually be lesser deities, when instead we should rather look for the Big Guy. 🙂

    Finally you say something like: “Weird Physics + Invisible Organ-less Non-adaptive-fitness-providing Mechanisms x Precognition x Telepathy x Telekinesis.” Parapsychology does have problems… Admittedly, even according to psi researchers who believe in psi, two chief problems are, 1: possible theories for the paranormal and, 2: the role of the paranormal in Nature. We have been very clumsy in tackling these issues, IMHO. (Note that I am not a psi researcher. I am just considering these issues to be a concern of us all, no matter where we stand regarding it). So we are pretty much in the dark trying to make sense out of something that almost everybody agrees that… may not even exist! But since we (i.e. some of us) *are* trying to make sense out of it, one possibility (aside from the alternatives offered by Scott: poor studies, biased researchers, weird combinations of the two previous alternatives, etc) that comes to my mind when I look at the apparent anomaly in the Ganzfeld database (or when I look at Bem’s results now) is that this anomaly is a very “intentional” phenomenon. What I mean is that: even if you can control electronic devices with the electricity from your neurons (something that we now can routinely accomplish), it takes a lot of practice and “informative feedback” for you to learn how to master it. Yet, when it comes to shifting the odds in the desired direction in Ganzfeld sessions, we seem to be naturals in that… I think the only biological conclusion that can be drawn from it is that, if psi exists, we all use it routinely. But… maybe we do not use it for the things most people (including almost all parapsychology researchers) believe it is used for.

    Anyway, just thoughts…

    Best Wishes,
    Julio Siqueira

    • So far as the Invisible Organ is concerned, if we don’t know how psi works, how likely would we be to recognize an organ for it?

      • Hi Nancy,

        That is really an impediment. And just to give an example that makes things far far worse: we knew pretty well how enzymes work. We knew immensely well (some might say: astronomically well) how quantum mechanics works. Yet, no one, for decades, was able to unveil the fact that enzymes make use of quantum mechanics. Now, imagine this scenario under the conditions that you reminded us of (we not knowing how psi work). The expected result might as well be: decades or even centuries of searching in the dark.

  47. Pingback: Lightning Round – 2014/05/07 | Free Northerner

  48. MoodyDoc says:

    The first thing that came to my mind when I read about the weird controversy of the results of Wiseman & Schlitz’s is that one scientist has psi-powers and the other not, while the subjects are all in average the same susceptible. Or it is really a kind of placebo that works telepathic. Like, when the one scientist stares at the subject while thinking “you can feel that I’m staring at you now!” this is subconsciously sensed. But if the other looks at the screen he acts more like an observer rather than an influencer. Still, observing the experiment from the outside by scientific means one would not see the difference in the “input” but only the output. However, if so, then a life brain scan of the experimenters would surely be interesting to look at. And again another dataset to analyse more or less objective…

  49. Pingback: Nothing About Potatoes | Things I found on the internet. Cannot guarantee 100% potato-free.

  50. Pingback: What we’re reading: Dealing with missing sequence data, SNP2GO, and the challenge of replication in bad results | The Molecular Ecologist

  51. Norm DeLisle says:

    Very Nice! A couple of other observations. My father was a process development engineer at Dow Chemical. His design work in this area was to take a research result and test it’s commercial viability in a sizable plant process, a kind of enhanced replication. Researchers thought this was grunt work (the persistent attitude toward replication), and that all that was required was to make what they did in the laboratory bigger. But when you increase the size of a process by 5-7 orders of magnitude, a great many things become different. The lesson is that it isn’t always clear which changes in experimental conditions are important to the outcome, and the opinion of the researcher is a poor guide.
    The second observation is from an article I read from the late 70-80s. It was a test of the hypothesis that niacin in large doses reduced the symptoms of schizophrenia. The design was double blind and the people who evaluated the improvement didn’t know who was receiving the niacin. However, the waiting room for the people being evaluated would hold 4-5 people at a time. Naturally they talked, and because of the niacin flush, they quickly figured out who was on placebo. It was, incidentally easy to figure out what the drug was, too. The conclusion was that one-third of the people on placebo broke the blind and went out and bought niacin. No one told the researchers because there were incentives for not telling the blind had been broken. Experimental conditions includes everything, not just what the researcher thinks is critical to successful publication.
    Also, there is some evidence that placebos work even when people know they are placebos.

    • A well-known example of this phenomenon is the development of the Haber-Bosch nitrogen fixation process. Haber established the concept on a tabletop and Bosch overcame obstacles to do it large scale. They rightly share the credit, but it’s easy to slip focus back to the “original genius” perspective. Fascinating stories.

  52. Pingback: Das Versagen der Religionen - Seite 7

  53. Put Down Artist says:

    “That doesn’t tell you much until you take some other researchers who are studying a phenomenon you know doesn’t exist – but which they themselves believe in – and see how many of them get positive findings.”

    This statement made my jaw drop. The illogic is stunning. You have made an assumption, assumed this assumption to be true, and are then deriding the people who are researching the question with an open mind.

    I’m pretty sure this doesn’t need to be explained to you, but you don’t, in fact, know that these phenomenon don’t exist until you have studied them scientifically. No ifs, ands or buts. Possible bias by researchers has to be taken into account when considering their results, and when they themselves formulate their experiments – in any scientific research.

    However, it is ridiculous to assert that because you personally don’t believe in something anyone trying to determine whether there is a way to scientifically measure and validate alleged phenomenon is automatically wrong.

    This is a very worrying thought pattern I see from ‘skeptics’ all the time. It is not skepticism at all, it is a kneejerk response to things that threaten their personal world view and belief system. By this definition 99% of the people on the planet are ‘skeptics’, and the only genuine skeptics are those prepared to challenge their own world views.

    I’m concerned that someone could actually take that statement seriously, so, allow me to assert that to actually prove that a group of researchers science is wrong, you actually have to go into their methodology and conclusions and find errors. To take issue with their conclusions and then assume that because you don’t like them their methodology must be flawed is the the least scientific approach imaginable.

    Sheesh.

  54. Love your post, and v pleased that you approve of our KPU trial registry 🙂
    Just FYI, the other parapsych pre-registry that you refer to (the one Richard Wiseman and I set up for Bem replications) dates back to November 2010 and is no longer active.

  55. Pingback: Science smorgasbord 2 | Deadline island

  56. Phil Goetz says:

    The results? Schlitz’s trials found strong evidence of psychic powers, Wiseman’s trials found no evidence whatsoever.

    Take a second to reflect on how this makes no sense.

    It makes perfect sense. Schlitz has psychic powers. Wiseman doesn’t. They need to redo the experiment, keeping Schlitz as the starer in both groups.

  57. Pingback: other mind meditation | Meditation Stuff (@meditationstuff)

  58. Stephanie says:

    Wow, what an amazing post. I love your blog, it’s awesome. I just had my heart broken a little though. This is why I’m a physicist. Still hard as hell and not as clear as many people think, but easier than fields involving biological organisms to get some level of confidence in your results. As long as you actually care about reality. Some physicists are just mathematicians, ie string theorists.

    Did you ever read the golem? We read it in a philo of science course i took in the education department. (http://www.amazon.com/The-Golem-Should-Science-Classics/dp/1107604656). It focusses a bit too much on the uncertainty side, but, i think too many people have gone from faith in an invisible sky god to faith in “Science”. I have a post on my blog with an essay i wrote for fun in grad school comparing incentives in science vs incentives in free market capitalism. I always found faith in the invisible hand of free market capitalism to cure all human ills to be a bit too much like faith in the invisible sky god, and faith in “science” is right up there. I have faith that, in the long run, science will probably get closer to reflecting how the universe actually works, but not in any particular current paradigm. Skepticism is a virtue in science. Except in climate change, which i consider more like a religion, which i also have post on.

  59. Kevin Keough says:

    Check out Rupert Sheldrake re much of this

  60. Solo Atkinson says:

    “…both authors suggest maybe their co-author hacked into the computer and altered the results.”

    Actually, it was more collegial than that. Together, they suggest that one of them may have hacked the results.