Saturday, January 08, 2011

New Evidence for ESP?

The Journal of Personality and Social Psychology, a respected academic journal published by the American Psychological Association, will soon release an article by Cornell psychologist Daryl Bem, that supposedly demonstrates the existence of “extrasensory perception,” or ESP. A preprint of the paper is available at http://dbem.ws/FeelingFuture.pdf.

ESP is a term used in popular culture for unexplained psychic effects. It is used exclusively, for example in the New York Times article of Jan 5, 2011 reporting on Bem’s paper (http://www.nytimes.com/2011/01/06/science/06esp.html?src=me&ref=general). Academics refer to such effects either as “paranormal,” “parapsychological,” or “psi” phenomena. These psi phenomena allegedly include a potpourri of unexplained effects, such as mental telepathy, remote viewing, clairvoyance, telekinesis, precognition, and communication with the dead, to name just a few varieties. Bem’s paper focuses on precognition, which is unexplained knowledge of the future, and premonition, which is the same thing only felt emotionally instead of known intellectually.

The paper reports nine experiments, only 4 in any detail, that were conducted over a decade, with a thousand people tested. In a typical experiment, participants make a prediction about where a picture, (called the “stimulus”) will appear on a computer screen. If the prediction is correct, then either it was a lucky guess or, the person had a premonition of where the stimulus was going to be. In a typical experiment, participants had to guess whether the picture would appear on the left or right side of a computer screen. Random guessing would produce a 50% correct rate, but the guesses were correct 53% of the time, a percentage greater than chance. That doesn’t seem like much of a difference, but since the test was run many times on each person, the finding is statistically rare enough that it is probably meaningful. Therefore, according to Bem, a slight, but scientifically proven result of precognition, or premonition of the future, was demonstrated.

Bem notes in his paper that “Psi is a controversial subject, and most academic psychologists do not believe that psi phenomena are likely to exist.” That is correct, and I am one of those psychologists. I do not believe any psi phenomena have ever been demonstrated scientifically, nor indeed that they exist at all. How do I explain then, scientific findings such as Bem’s (and there have been many such supposed demonstrations of psi phenomena over the years)? There are four obstacles to acceptance that any such scientific demonstration must overcome:

Methodology. . The experiment must be designed and conducted in such a way that the best, most reasonable conclusion is that psi phenomena have been demonstrated, rather than some other explanation, such as pure chance, lurking (uncontrolled) variables, equipment or procedural error, biased sampling, unintended clues being given to participants, inadequate experimental controls, or other kinds of unintended bias or error (deliberate fraud is not considered, as that is rare and easy to detect).

2. Statistical. The experimental data must be conceptualized, analyzed and reported in a simple, correct, and non-controversial way, so that the best, most reasonable conclusion is that psi phenomena have been demonstrated. Even if the experimental procedure was sound, the statistical handling of the data can introduce biases that lead to invalid conclusions, such as when the data are manipulated inappropriately (e.g., leaving out some data from the analysis), or conceptualized strangely (such as counting certain results in one way, other results in another), or analyzed with controversial or questionable statistical techniques, or interpreting the outcome in obscure or inappropriate ways.

3. Theoretical. The findings must be coherent with an existing body of scientific data, or if they are not, some revision in understanding of the existing data must be specified which accommodates the new, anomalous finding. There are two reasons for this requirement.

One is that according to the scientific method (a consensus model of scientific reasoning), the hypothesis that an experiment tests is drawn from the existing body of scientific data. A scientist does not just wake up one morning with a hypothesis that says, “I suspect that giraffes would float in water as well as raspberry marshmallows.” That is not how science is done. Instead, the scientist finds areas in the existing body of scientific knowledge where there are questions, errors, gaps, unexplained connections, or incomplete understanding. A hypothesis is then generated that could extend the existing knowledge or make it more understandable or more internally consistent.

The second reason for requiring that scientific findings must mesh with existing knowledge is that science is a cumulative exercise in knowledge production. Even if some arbitrary and idiosyncratic hypothesis were experimentally tested and confirmed, the result would be uninterpretable because it would not connect to existing knowledge, would not further general understanding, and would not even contradict what is already known. There would be no context for making sense of the experimental result, making it essentially meaningless, no matter what it purports to demonstrate.

Historically, strange things have sometimes been found in nature that could not be explained until much later, such as lightning or x-rays. But technically, those discoveries were anomalous observations, not scientific findings, until some explanation was hypothesized that could be tested under a scientific hypothesis.

4. Philosophical. A scientific finding that meets all of the foregoing requirements still must be interpreted in a scientific way. For example, a finding that concludes, “All human beings are therefore merely ideas in the mind of God,” cannot be accepted without a great deal of further explanation. The interpretation of the finding must conform to principles of scientific reasoning and evidence. This examples fails on both counts, because there is no scientific evidence of God, and to characterize humans as ideas rather than as biological objects is not consistent with scientific reasoning.

Alternatively, the interpretation of the finding can go too far in the other direction, being so scientifically overspecified that the result admits of no generalization, an error of “external validity.” An example would be an experiment that claims to study “violence in children” but defines violence as a high frequency of button presses on laboratory equipment. Since that does not describe what we normally understand as violent behavior, even if the study meets all other criteria, we are unable to say anything about the result beyond the specific experimental procedure.

Another common error is that an study defines its variables in terms of laboratory procedures, but interprets its results in different terms, an error of “internal validity.” In the example above, if college students were used as participants, it is not valid to conclude anything about violence in children.

Bem’s studies that purport to demonstrate psi phenomena fail to overcome any of the obstacles described, and therefore I remain unconvinced of the existence of so-called psi-phenomena.

To prove this definitively, I would have to study the experimental protocol, data, and statistics to make detailed criticisms, and that would require either that I have access to Bem’s laboratory notebooks, which is not going to happen, or I would have to repeat his experiments, step by step, in order to understand what he did and what kind of data he obtained. That is also not going to happen. So, like any other ordinary consumer of scientific information, I must base my acceptance of, or criticism of the experiments based only on the scant information provided. Here are some criticisms then, within that constraint.

Summary of Bem’s Experiment 1

1. Methodological factors. In Experiment 1, a featured experiment supposedly demonstrating precognition, participants had to guess which of several pictures would be randomly shown. I’ll start by summarizing the experimental procedure.

One hundred Cornell undergraduates were self-selected (volunteer) participants, half men, half women and were paid for their participation. They all knew it was an experiment in ESP.

A picture of starry skies was shown on the screen for three minutes, while new-age music played. Then that picture was replaced (and presumably the music terminated, although that is not stated), with two pictures of curtains, presumably side-by-side (although that is not stated). When a participant clicked on one of the pictures of a curtain, it was replaced with another picture, either a picture of a wall, or a picture of something other than a wall.

The content of the “other than a wall” pictures is not described, except to say that 12 of the 36 non-wall pictures showed humans (presumably – this is not specified) engaging in “sexually explicit acts” (not further described), while another 12 of the non-wall pictures were “negative” in emotionality (not further described), and the final 12 non-wall pictures were “neutral” (but not described).

All these pictures had previously been (although when, is not stated) rated by other people not in this experiment as being reliably “arousing” for males and females (although “arousability is not defined), or as being reliably “emotional.” There is no information about whether any arousing pictures were also emotional, and it is hard to imagine that they were not. There is no definition of what constituted a “neutral” photograph, and there is no description of the arousability or emotionality of the wall picture or of the curtain pictures.

Part way through the experiment, some or all (not specified) of the “arousing” pictures were replaced by more intense (not otherwise described) internet pornography pictures, which were not reported to be scientifically rated for arousability and emotionality, so in the end, the nature of these pictorial stimuli is essentially unknown. (We assume that among the 36 non-wall photographs, none of them was in fact, of a different sort of wall, although this is not actually stated.).

The non-wall pictures were selected at random from the group of 36, with randomness defined by a software algorithm. Whether the wall or non-wall picture was placed on the left or the right of the screen was also randomized by the computer.

Each participant’s task was to click on one of the two pictures of curtains to indicate which one they thought would be replaced by a non-wall photograph. They were told that some of the pictures were sexually explicit and allowed to quit the experiment if that was objectionable. No information is given on how many participants quit. After the participant’s choice was made, the curtain picture was replaced by another picture, either of a wall or a picture of non-wall content.

Errors of Internal Validity

That summarizes the experimental protocol. According to Bem, this methodology made “the procedure a test of detecting a future event, that is, a test of precognition” (p. 9). However, that is not how the results were recorded. You would think that the scientist would simply record whether or not the participant had correctly predicted which side of the screen the non-wall picture had appear on (since that was the instruction given to the participant, and that was the hypothesis to be tested). Instead some other, strange measure was recorded, namely, the number of correct predictions of which side of the screen the “erotic” (meaning sexually explicit) pictures occurred, even though that was not the hypothesis being tested. This odd recording of the results constitutes an error of internal validity.

The hypothesis that college students will be better at predicting the location of a sexually explicit picture is unconnected to the introductory literature review, which referred only to a previous body of findings that asked for straightforward predictions of visual content, with no special reference to sexually explicit material. This new (implicit) hypothesis is then, essentially like the “giraffe and marshmallow” hypothesis, arising “out of the blue” rather than being logically derived from existing knowledge. This is another methodological error. If there is, in fact, a previous body of knowledge about predicting the locations of sexually explicit photographs, then the error is one of scientific reporting, since the literature review was obviously grossly incomplete.

One other, rather minor error, is the experimenter’s referring to the participants’ prediction of the location of a non-wall photograph as a “response” to that photograph. But this is a semantic distortion, since the participant’s choice is made before the photograph is shown. Ordinary, common-sense language would call that choice a “prediction” not a “response.” For the experimenter to call it a “response,” presupposes the validity of his belief that the participants are seeing into the future, but until that is proven by the experimental results, it is scientifically inappropriate to use the language in a non-standard way without justification.

Statistical Errors
Next, Bem reports that participants correctly predicted the position of the sexually
pictures significantly more frequently than the 50% rate expected by chance, and in fact were correct 53.1% of the time. But this is an incorrect analysis. To be counted as correct, a prediction would have to correctly say on which side of the screen a non-wall photograph would appear (one chance out of two, or 50% chance rated) AND, if that guess were correct, they would also have to predict that it was a sexually explicit photograph (12 chances out of 36, or 33%) for an overall probability of 0.50 x 0.33 = 0.165, which means that one would expect a person to guess correctly fewer than 17 times out of a hundred.

Did that happen? No information is reported on how many times the participants DID actually predict the location of sexually explicit photographs. It was not 53.1%. That is the number you get when you ignore, or leave out of the calculation, all the wrong predictions of the non-wall photograph. But that is an illegitimate way to count the results, unless there is a very good reason, and none is given.

Still, can we at least say that the participants correctly predicted the location of ANY non-wall photograph better than chance (53.1% vs. 50%)? No we can’t, because that information is not reported either. Instead, what is reported, is that participants predicted the location of only the sexually explicit photographs at 53.1%. But that leaves out all the results for the non-sexual predictions, which is not a legitimate way to count the results. So in the end, the results that bear on the experiment’s stated hypothesis are not reported at all.

This kind of anomalous counting of the results constitutes a statistical error and makes the experimental findings uninterpretable.

The same kind of anomalous, illegitimate, and uninterpretable counting of results is given for non-sexually-explicit pictures, emotional pictures, and neutral pictures, and even for pictures that were “romantic but non-erotic pictures,” a category that was never defined in the description of the pictures (let alone in any experimental hypotheses).

The experiment also reports that there were no significant differences in response findings between males and females. That is a legitimate “control variable” to be reporting, although the experimental hypothesis being tested has nothing to do with gender. So that is not an error, as much as an irrelevance.

Then the experiment reports on a history of findings in other experiments that turns up a small correlation between the ability to predict the occurrence of visual materials at a rate greater than chance, and the participant’s score on an extraversion test, with extraverts supposedly being better at making such predictions over introverts. There are two problems with this so-called result.

One, is that it is based on a statistical technique called meta-analysis, in which the main findings of individual experiments are treated as if they were individual response data points observed in individual participants. While this statistical technique is now widely used in the medical literature, it is by no means without controversy when applied to psychological experiments, and I reject it as a valid statistical technique for psychology.

The main reason for my rejection is that the technique generally does not take into account the quality of the underlying experiments, or if it does, does so inadequately. For example, if some future meta-analysis includes this experiment, that will introduce significant undocumented error into the meta-analysis because this experiment does not actually report any valid results, despite its claim to the contrary.

The second problem with this so-called reported result between predictive success and personality is that it is irrelevant to any scientific hypothesis, implicit or explicit, that was supposed to be tested by this experiment.

Errors of Interpretation:
The experimental report goes on at great length to determine what “kind” of psi phenomenon had been demonstrated by the test results (which were never properly reported). Was it simple clairvoyance or was it a subtle form of psychokinesis? Or was it actually pure chance (admirably, the report does consider that possibility).

But a simpler explanation is hinted at by the experimental procedure itself. After the participant made his or her prediction of where the non-wall photograph would appear, the curtain picture chosen was replaced with either the wall, or a non-wall photograph. This essentially gave the participant feedback on the correctness or non-correctness of the prediction. But why was that necessary or desirable?

The experimenter knew immediately upon the participant’s click whether the prediction was correct or not. That could be scored right on the spot by the computer. Why was it important to give the participant “feedback?” The experimental hypothesis was about ability to see into the future, precognition. Why is feedback necessary to do that? Was the hypothesis really that ability to predict the future can be taught by a computer and learned with practice? There is no theoretical or practical reason to believe so, and the experimental report does not suggest it.

The only reason I can think of to give the participants feedback on the correctness of their predictions is so that they might be able to learn from their mistakes and improve their performance. That is a standard learning procedure going back over a hundred and fifty years in experimental psychology and thousands of years in human experience. The experiment thus introduced a spurious learning paradigm into a procedure that was supposed to test only ability to predict the future. That is a serious error of internal validity that renders the experiment uninterpretable (if it was not already).

What would the participants be learning, with this embedded learning procedure? I am unable to say without more detail about the experiment. Could they be learning (even if only implicitly) to detect a non-random pattern in the order of presentation of the materials? A non-random pattern could have emerged. Either the random number generator could have been imperfect (since there is, theoretically, no such thing as a perfect random number generator), or, within the pseudo-random sequence of events, an identifiable non-random pattern could have emerged, just as when one flips a fair coin “heads” 7 or 8 times in a row, just by chance. These things happen. It wouldn’t take much non-randomness to produce a mere 3% deviation from chance expectations.

Or, more likely in my opinion, the participants could have been learning something else, some other clue that was unintentionally left in the procedure by the experimenters. I cannot say what that might be. For example, it would be interesting to know if an experimenter was in the room while the participant performed. There is no reason why one should have been, but if one was, there are all kinds of opportunities for subtle, unintended clues (or “experimenter effects”) to be transmitted to the participants.

Bem reports that he re-ran the experiment but using randomized, simulated computer inputs for the “predictions,” with no human participants involved. Under those conditions, no psi effects were detected. I am not surprised, but that result furthers my skepticism about the human-based findings: that if there really were any legitimate ones (which I doubt), they were due entirely to unintended experimenter effects or performance biases.

The only way to satisfy my skepticism on this point would be to re-run the experiment, with humans, but omitting the spurious learning component from the procedure, and isolating the participant completely from any contact with the experimenter or any other participant. I would be extremely surprised if any so-called “psi effect” were reported under those conditions.

Theoretical and Philosophical Errors:
Aside from the methodological and statistical problems with this study, there are additional theoretical and philosophical problems. First, I must emphasize again that no psi phenomena were demonstrated by any of these experiments, as reported. But even if there were such a thing as psi phenomena, for example, ability to predict the future at a rate better than chance, what sense would it make?

There is no known mechanism, either biological, physical, or psychological, by which that would be possible. Human beings are simply not able to predict the future very well. Would that it were otherwise! Bem does some hand-waving around quantum indeterminacy and the earth’s magnetic field to suggest possible explanations of psi phenomena (if they existed), but that verbiage constitutes, most generously, only loose metaphor, nothing close to an explanation.

Could the explanation of psi effects, if there were any, just turn out to be something bizarre, something we have never thought of yet, not related to anything familiar, not like anything ever reported in the accepted scientific literature? Well, yes, that is possible in principle. I’m sure Socrates himself would not have been able to understand a butane lighter or a sheet of plastic food wrap, let alone some of our more complex technological marvels. So it is not a denial of the possibility of psi phenomena to assert that there is presently no conceivable explanation of them, as they have been described. But it is utterly idle to speculate on explanations until the phenomenon to be explained has been demonstrated, and I am not convinced it ever has been.

Conclusion:
In his forthcoming paper, Bem describes three additional experiments, similar to the first one, in some detail, and refers to five others, not fully described. However, as is always the case when I take the trouble to read such experimental reports, after analysis of the first one (an analysis that was by no means exhaustive), I simply have no energy to go on to the rest of them. The quality of the first one is so poor that there is little promise that the others will be much better. So I give up at this point and return to my default belief, that has not been challenged by Bem or anybody else, that no psi phenomenon has ever been scientifically demonstrated. Show me a proper demonstration and I’ll change my mind.