Linguistic Experiments with WebExp2

[Overview Local setup WE2 Support Design Issues Creating WE2 Experiments Advertising Results & Analysis]

General Design Issues

I am by far no expert in experimental design issues, but I have tried to collect some basic thoughts on experimental design issues on this page. As sources I used Kimmel (1970), Cliff (1996), as well as personal communications with Lera Boroditsky, Ted Gibson, and my own common sense (yes, p.c.). None of those is to be blamed for whatever you think misleading or even disconcerting on this page.


The basic idea of experiments

The basic idea of experiments is that you come up with a hypothesis (or you take someone else's hypothesis) and you want to test whether this hypothesis holds. Usually you will consider an hypothesis that states that one difference (variation along some kind of dimension) stands in some kind of causal relationship to another difference (variation along another dimension). For the purpose of testing, the former is called your independent variable (or variables if you're interested in the effect of variations along several dimensions) and the latter is called the dependent variable. NB: Never forget that these variables usually are operationalizations of what you really are interested in! For example, you may have a hypothesis that certain things make it harder to process a language. In your experiment you may operationalize "harder" in terms of reading times in a self-paced reading time study, but never forget that the reading times you will measure are themselves merely a behavioral correlate of some cognitive mechanism, and that it is that mechanism that you're making hypotheses about.

It is important to understand that at least the most interesting hypotheses state causal relationships. Any kind of statistical test we conduct to test this a hypothesis, however, will only be able to test the presence of correlations, which by definition are non-directional. Another thing to keep in mind is that we will never be able to actually prove a hypothesis. The only thing that can be done is to reject competing hypotheses. This is done by showing that these hypotheses make the wrong predictions about the relation between your independent variable(s) and the dependent variable. There are basically three ways in which the predictions of competing hypotheses can be rejected (here I ignore cases where your hypothesis is that "nothing is going on", i.e. where you're predicting a null effect; accepting a null effect or rather rejecting the hypothesis of inequality is possible under certain assumptions, but slightly more complicated than the cases considered here. I refer you to Shravan Vasishth's homepage that contains instructions on how to conduct such tests).

A competing hypothesis may predict the opposite correlation of yours. For example, one hypothesis may predict an increase in acceptability (your dependent variable) given a certain manipulation (of your independent variable(s)), but your hypothesis predicts a decrease in acceptability. If you find a significant decrease in acceptability in your experiments, you can reject the competing hypothesis.

Another competing hypothesis that can be rejected in such a case, is the null hypothesis, i.e. the hypothesis that "nothing is going on" with regard to the relation between the dependent and the independent variables. The null-hypothesis always exists. Even if no other researcher has suggested another hypothesis for your phenomenon, you can test your hypothesis by rejecting the null-hypothesis (of course, this doesn't prove your hypothesis; it merely puts it out there until someone rejects it).

Finally, and this is where things get more complicated, your hypothesis may 'outperform' competing hypotheses quantitatively. I won't say much about this case, since it is still a rather uncommon test in research on the psychology of language (but, I think it will become more important). The general idea is that one hypothesis predicts the observed relation between dependent and independent variables better, in the sense that it predicts more of the observed variation. How this can be determined will be described in the section on Comparing Models: Goodness-of-fit).

The remainder of this page briefly deals with the following topics:

Finally, I highly recommend reading some papers by Reips, who has laid out the trade-offs of internet-based experimentation. Reips (2002) Standards of Internet-based Experimenting in experimenting. Experimental Psychology, 49 (4) (pp. 243-256) summarizes drawback and advantages of online experiments, as well as some ways around common problems.


Basic terminology

The typical psycholinguistic experiment consists of a number of conditions, which results from e.g. a factorial design. For example, there may be two factors, one with two levels (i.e. the different values a factor can take) and one with three levels. In a full-factorial design, each level of each factor is crossed with all levels of all other factors. So in the example just mentioned, there would be 2 x 3 = 6 conditions.

Since the goal of experiments is to generalize across the observed sample to the overall population, we test the effect of many different instantiations of the conditions (items) on many different people (subjects/participants). For language experiments, an item can be thought of as a lexicalization of the conditions. That is, if an experiment has n conditions, an item will consist of n different stimuli (e.g. sentences) which preferably differ only with regard to the factors that you intend to manipulate. Here is an example.

Let's say we are interested in it-clefts ("It is NP-X, who NP-Y ..."). More precisely, we want to know how the acceptability of it-clefts is influenced by (a) the form of the clefted NP-X and (b) the form of the subject NP-Y. We come up with a 2 (NP-X is a pronoun vs. a common noun phrase) x 2 (NP-Y is a pronoun vs. a common noun phrase) design resulting in 4 conditions. An example item in its four conditions is given below. Note that the four stimuli are identical except for the form of NP-X and NP-Y. That's what is meant by that the stimuli of an item should "differ only with regard to the factors that you intend to manipulate".

    [PRO, PRO] It is you, who I dislike.
    [PRO, CNP] It is you, who men dislike.
    [CNP, PRO] It is cowards, who I dislike.
    [CNP, CNP] It is cowards, who men dislike.

Design guidelines

Below, I discuss the following guidelines for good experimental design:

  • What's a good hypothesis?
  • Keep it simple
  • (If necessary) bias against your own hypothesis
  • Random sampling
  • Balancing
  • Counterbalancing (Latin-square design)

What is a good hypothesis?
A good hypothesis makes different predictions from competing hypothesis. These differences must be testable. Sometimes you may prefer a theory over another on conceptual grounds while available methodology cannot distinguish between the competing hypotheses, but usually the most interesting hypotheses are testable given current methodology, or with a methodology that you can develop given current technology. Most of the psycholinguistic research currently being conducted investigates qualitative hypothesis, i.e. theories that state that a certain factor matters, or that it matters before or after another factor, more or less than another factor, etc. Almost no research attempts to actually quantify those differences or to construct complete models that incorporate several factors in order to predict a human behavior (such as acceptability judgments). Arguably, such quantitative hypothesis/theories are to be preferred over merely qualitative one, and there are some attempts to develop such theories (sometimes this research runs under the heading of "towards a zero-parameter model of X"). So, when you develop a hypothesis, be clear on what you want to predict. Is it a qualitative difference (even then it matters in which direction it will go), or are you making a more detailed claim about the quantitative relationship between variables?


Keep it simple and focus on your hypothesis
Resist the temptation to pack all ideas you have about a phenomenon into one experiment. Too many levels (of factors) will require too many stimuli per subject (see Items per condition). So, don't include all manipulations you can think of into the first experiment. Try to establish first that there is a good reason to assume that the effect you hypothesized actually exists.

While it is good to control for possible confounds, there is a trade-off in terms of how much energy you put into the first experiment on a new topic. You don't have to address all possible objections to the interpretation you intend to make (given certain results) in the first experiment. As a rule of thumb, I suggest to control or balance for two types of things:

First of all, you should make sure that your hypothesis predicts a different outcome of your experiment than the competing hypotheses. Your experiment makes no sense, if it doesn't distinguish between your hypothesis and competing hypotheses.

Second, you should control or balance differences that are known to affect the type of dependent variable you are measuring (e.g. reaction times, acceptability judgments, etc.). For example, it is well known that plausibility, dependency length, and lexical bias can affect reading times. Similarly, complexity of the stimulus and probably plausibility affect people's judgments of how natural a sentence is. So, if none of these things is what you're interested in, be sure to control or balance them across items.

You can run additional experiments. Actually, most journals require several experiments to a publication (at least 2, often more). So, running follow-up experiments isn't only better in terms of getting clean results, it'll also make publishing much easier ;-).


Bias against your hypothesis
In some cases, it isn't possible to totally balance for all the factors you think may matter. Be sure not to confound your design in your favor. If you have a choice, always bias the design against your own hyptothesis. This is the only way to arrive at some certainty that what you find actually is due to your hypothesis.


Random sampling
Many books have been written about random sampling, external and internal validity, etc. The gist of this is that participants in your experiments should be sampled randomly from the population you are interested in. Often this isn't the case. Most psychology experiments are conducted with 16-22 year old college students as participants. Be aware that any sampling method may introduce a bias (for example, you may sample subject over the WWW, as we will do it for our WE2 experiments, but this way you bias your sample to contain only internet users). The important question is whether you have reason to believe that the behavior your will observe from your sample cannot be generalized to the population. Never forget that it is the population that you are interested in, not the summed individual difference of the people in your sample. This will also have consequences for the statistical analysis of your experiment, because we should treat those randomly sampled participants as random factors, rather than fixed factors. I will get back to this in the section on Random vs. fixed factors. NB: Random sampling becomes especially important whenever subjects are grouped together in order to test some between-subject manipulation. For example, say you are interested in the effect of instructions on acceptability judgments. You form two groups of subjects. The first groups is asked to rate "how natural" each sentence is. The second group of people is asked "how correct" each sentence sounds. In such a experiment, it is crucial that you can be certain that subjects are assigned to each group randomly. Imagine that for some reasons the first groups consists mostly of editors of Webster's and the second group consists of members of the "Society for the decay of the English language". You wouldn't want to attribute differences between the results of the two groups to the difference in instructions they got.

For language experiments, we don't only want to randomly sample across subjects. We also want to generalize beyond the sample of language that we used. This means, we should randomly sample our items (sentences) from the population (e.g. English). This point isn't trivial. When we sit down and think about items, e.g. verbs, for our experiment, it will usually be the high-frequency words that come to mind first. Depending on your research questions, this can be highly problematic. Use databases (a lexicon, or e.g. the MRC psycholinguistic database) to overcome this bias (e.g. you could use every third verb that meets your criterion in alphabetic order). Like for subject, random sampling of items matters in two ways: (a) you want to be able to generalize beyond your sample (external validity); (b) you want to make sure that differences between the conditions in your experiment are not due to violations of random sampling (which would lead to confounding; internal validity).

The latter, internal validity, is often violated. Consider that you hypothesize that verbs with more participants are more complex to process and that this leads to lower acceptability scores for sentences with ditransitive verbs than for sentences with intransitive verbs. Now let's say that the intransitive verbs you sample are more frequent than ditransitive verbs you sample. In that case, any difference you observe could be due to frequency and you cannot conclude anything from the results. Note further that random sampling may not always solve this problem. In the current example, if intransitive verbs are on average more frequent than ditransitive verbs, random sampling would lead to the same confound. In such cases, you may consider balancing your data (for frequency).


Sometimes you have to balance your data against possible confounds. The basic idea of balancing is that you try to minimize differences between your conditions to the difference that you intend to manipulate. I have already given some examples above. There is reason to believe that frequency, plausibility, and complexity all influence acceptability judgments. If that is not what you're interested in, it is important that you balance those variables across your items. What does that mean? It means that the mean value (averaged across items) for a balanced variable is the same (as in: statistically not different) across conditions. You can do this with the help of norming studies, which test whether your data is indeed balanced (e.g. you asked subject in a pilot experiment to rate your stimuli for plausibility to make sure plausibility is balanced across conditions). Many such rating studies are published (e.g. Gahl & Garnsey, 1997) or available in form of annotated databases (e.g. the MRC psycholinguistic database contains normed judgements of the "concreteness" and "imaginability" of referential nouns; it also contains frequency information). An alternative to balancing is controlling, which means that you allow the potentially confounding variables to differ across conditions but you enter them into the statistical analysis of the results (see the section on Controlling).


Counter-balancing (Latin-square design)
IN PROGRESS. ask me.

Items per condition
IN PROGRESS. ask me.


Web Experimentation

You may find this summary of common problems (and solutions) in web experimentation useful (it is based on Reips plus some of my own thoughts).


Ethical concerns and IRB approval

If you want to run an experiment, and especially if you intend to publish the results, you will have to check whether the experiments needs approval by the Stanford IRB (Internal Review Board, Human Subjects Research). The department has made an agreement (or, in fact, several agreements) with the IRB that some of the standard experimental methods used by linguists are exempt from IRB approval. Talk with Penny Eckert about this. In case of doubt, or if required, make sure that you get IRB approval for your study. This is a time-consuming process and the IRB only meets every other month. So make sure to apply early on. You will have to fill out questionnaires about your study and, if it is the first time you register for a study, you will have to take an online tutorial on running human subjects and the ethical constraints/consideration involved.

One of the requirements that the IRB has for running experiments is that you get consent from your participants. This is done via an online form that informs subjects about the risks, payment, duration, and tasks involved in taking the experiment. I strongly recommend asking for informed consent even if you are not required to do so by the IRB. Also, if not required otherwise, I recommend restricting participation in your experiments to adults (whatever the legal age is in the country that your participants will come from).

The IRB will inform you of most things that have to be considered in running (online) experiments, but I have summarized some that I find very important below. Many of these points clearly serve your own interest in addition to being the right thing to do.

  • To avoid unnecessary frustration, inform subject early on of all requirements you study has (including technical requirements, age restriction, native speaker requirments, requirment to participate in multiple sessions, etc.).
  • Make sure you will be able to pay your participants. What information will you need (many places require social security numbers of participants for reimbursement, or even signed consent forms).
  • Display your contact information clearly on each page of the experiment. Give a phone number and an email address.
  • Display the contact address of the IRB clearly (e.g. on the consent form).
  • If necessary, de-brief your subjects after they take the experiment. Make sure that they will read the de-briefing (e.g. by saying that at the end of the experiment that will get further information about their payment).
  • Avoid offensive language/pictures/etc.
  • Be very clear that participants have the right to abort the experiment at any point. Also inform them (before and after the experiment) that they have a right to ask you to remove and delete their data from your experiment (name a reasonable deadline for that, e.g. 4 weeks). According to the ethical standard of human subjects research, it is extremely important that participation is voluntary at any point during the experiment.
  • Take your own experiment from an off-campus computer. Are the instructions too long, to complicated, is the task to annoying? Change the experiment =).

[Overview Local setup WE2 Support Design Issues Creating WE2 Experiments Advertising Results & Analysis]