Quantifying the benefits of using decision models with response time and accuracy data
Response time and accuracy are fundamental measures of behavioural science, but discerning participants’ underlying abilities can be masked by speed-accuracy trade-offs (SATOs). SATOs are often inadequately addressed in experiment analyses which focus on a single variable or which involve a suboptimal analytic correction.
Models of decision making, such as the drift diffusion model (DDM), provide a principled account of the decision making process, allowing the recovery of SATO-unconfounded decision parameters from observed behavioural variables.
For plausible parameters of a typical between-groups experiment we simulate experimental data, for both real and null group differences in participants’ ability to discriminate stimuli (represented by differences in the drift rate parameter of the DDM used to generate the simulated data), for both systematic and null SATOs. We then use the DDM to fit the generated data. This allows the direct comparison of the specificity and sensitivity for testing of group differences of different measures (accuracy, reaction time and the drift rate from the model fitting).
Our purpose here is not to make a theoretical innovation in decision modelling, but to use established decision models to demonstrate and quantify the benefits of decision modelling for experimentalists. We show, in terms of reduction of required sample size, how decision modelling can allow dramatically more efficient data collection for set statistical power; we confirm and depict the non-linear speed-accuracy relation; and we show how accuracy can be a more sensitive measure than response time given decision parameters which reasonably reflect a typical experiment.
DDM - Drift diffusion model
SATO - Speed-accuracy trade-off
2.1 Speed accuracy trade-offs
Speed and accuracy of responding are fundamental measures of performance, collected by behavioural scientists across diverse domains in an attempt to track participants’ underlying capacities. As well as being affected by the capacity of participants to respond quickly and accurately, the two measures are also related by participants’ strategic choices of a speed-accuracy trade-off (SATO; for reviews see Wickelgren 1977; Heitz 2014).
The SATO confounds measurement of participant capacity - it means that we cannot directly read either speed or accuracy as an index of participant ability. The SATO is inherent to decision making — it arises whenever we wish to respond as fast and as accurately as possible based on uncertain incoming information.
More accurate responses require more information, which takes longer to accumulate; faster responses forgo collecting additional information at the cost of higher error rates. Importantly, because the SATO is unavoidable it is also necessary that all decision-making processes are positioned with respect to the trade-off.
This does not need to be done deliberately or explicitly, but any decision process can be characterised as adopting some trade-off between speed and accuracy. For the tasks studied by psychologists, it is important to recognise that there will be individual differences, as well as task and group-related differences, in how participants position themselves on the SATO.
Outside of research focused on SATOs explicitly, different practices have been adopted to account for SATOs or potential SATOs in behavioural data. One approach is to ignore either speed or accuracy. For example, ignoring speed of response is common in psychophysics, whereas some domains of cognitive psychology where high-accuracy is assumed, focus only on response times (eg Stafford, Ingram, and Gurney 2011)1, albeit sometimes after a cursory check that standard null-hypothesis tests do not reveal significant differences in error-rates.
Another approach is to combine speed and accuracy. For example, in the domain of visual search it is common to calculate 'efficiency' scores by dividing search time by search accuracy as a proportion (eg Yates and Stafford 2018). Despite being widespread, there is evidence that this practice is unlikely to add clarity to analysis (Bruyer and Brysbaert 2011). We also note that the researchers who initially formulated the efficiency score explicitly counselled against using it in the case of SATOs (Townsend and Ashby 1983).
The efficiency score shares the property with other recent suggestions for accounting for SATOs (Davidson and Martin 2013; Seli et al. 2013) that it assumes a linear relation between response time and accuracy. While such approaches may be better than focusing on a single behavioural variable, the assumption of linearity is at odds with work which has explicitly characterised the SATO (Fitts 1966; Wickelgren 1977; Heitz 2014) and has shown a distinctly curvilinear relation between response time and accuracy.
As such, although linear correction methods may work for some portions of the SATO curve, they are likely to be misleading, or at least fail to add clarity, where accuracy and/or speed approaches upper or lower limits of those variables. Recently Liesefeld and Janczyk (2019) showed that several current methods for combing speed and accuracy to correct for SATOs are in fact sensitive to the very SATOs they are designed to account for. These authors advocate the balanced integration score (BIS; Liesefeld, Fu, and Zimmer 2015) as an alternative, but it seems likely that the combination of speed and accuracy remains an estimation problem of some delicacy, especially in the presence of SATOs.
The unprincipled combination of speed and accuracy measures becomes an urgent issue when considered in the context of widespread questions surrounding the reliability of the literature in psychology. Established results fail to replicate, or replicate with substantially reduced effect sizes (Open Science Collaboration 2015; Pashler and Wagenmakers 2012).
Low statistical power has been a persistent problem across many areas of psychology and cognitive neuroscience (Button et al. 2013; Szucs and Ioannidis 2017; Stanley, Carter, and Doucouliagos 2017; Maxwell 2004; Sedlmeier and Gigerenzer 1989; Lovakov and Agadullina 2017), including, but not limited to, research areas which are bound by costly methods or hard-to-reach populations (Geuter et al. 2018; Bezeau and Graves 2001; J. Cohen 1962). This, combined with factors such as analytic flexibility (Simmons, Nelson, and Simonsohn 2011; Silberzahn et al. 2017) — which can only be increased by the lack of a single standard method for accounting for SATOs — has led to a widespread loss of faith in many published results (Ioannidis 2005).
Statistical power is defined with respect to the variability and availability of data, as well as the analysis proposed. For a set experimental design, an obvious candidate for increasing statistical power is to increase sample size, but this is not always easy. Each additional participant costs additional time, money and resources. This is especially true in the case of expensive methods, such as fMRI, or special populations which may be hard to recruit. More sensitive measures also increase statistical power: lower measurement error will tend to reduce variability so that the same mean differences produce larger observed effect sizes.
A motivation for the present work is to demonstrate the practical utility, in terms of increased statistical power, of combining speed and accuracy information in a principled manner using decision models. Such an innovation has the appeal of making the most of data which is normally collected, even if not analysed, whilst not requiring more participants (which is costly), or more trials per participant (which also has costs in terms of participant fatigue which may be especially high for some populations, e.g. children).
2.3 Decision modelling
Models of the decision making process provide the foundation for the principled combination of speed and accuracy data, and thus afford experimenters access to considerable statistical power gains.
Many models exist in which decision making is represented by the accumulation of sensory evidence over time. When the accumulated evidence surpasses some threshold (also called a boundary) then a decision is triggered. The accuracy of the decision depends on which accumulator crosses which boundary, the speed is given by time this takes, and thus such models can be used to fit speed and accuracy data within the same framework.
A prominent instance of such accumulator models is the so called drift-diffusion model developed by Roger Ratcliff (DDM, Ratcliff 1978; Ratcliff and Rouder 1998). In these models the rate at which evidence is accumulated is represented by the drift rate parameter, which can be thought of as co-determined by the sensitivity of perceiver and the strength of the stimulus. After a long and successful period of development and application on purely behavioural data, the DDM model was at the centre of an important theoretical confluence. Neurophysiologists found evidence for accumulation like processes in neurons critical to sensory decision making (P. L. Smith and Ratcliff 2004; Gold and Shadlen 2001), whilst theoreticians recognised that accumulator models could be related to statistical methods of uncertain information integration.
Under certain parameterisations many different decision models, all in the family of accumulator models, can be shown to be equivalent to the DDM, and thus in turn equivalent to a statistical method which is optimal for making the fastest decision with a given error rate, or the most accurate decision within a fixed time (Bogacz et al. 2006; Gold and Shadlen 2002).
While debate continues around the exact specification of the decision model which best reflects human decision making, there is a consensus that the DDM captures many essential features of decision processing (but see Pirrone et al. 2018; Pirrone, Stafford, and Marshall 2014; Teodorescu, Moran, and Usher 2016). As you would expect, the DDM has also shown considerable success modelling decision data across many different domains (Ratcliff, Smith, and McKoon 2015; Ratcliff et al. 2016), and in particular at separating out response thresholds from stimulus perception (Ratcliff and McKoon 2008), and in estimating these reliably (Lerche and Voss 2017). In the sense that the DDM implements a statistically optimal algorithm for accumulation for uncertain information, we would expect our neural machinery to implement the same algorithm in the absence of other constraints (Pirrone, Stafford, and Marshall 2014).
The basic mechanism of the DDM is that of a single accumulator, similar to that shown in Figure 2.1, with the following key parameters: v, the drift rate which reflects the rate of evidence accumulation; a, the boundary separation, which defines the threshold which must be crossed to trigger a decision and so reflect response conservativeness; z, the starting point of accumulation (either equidistant between the two decision thresholds, or closer to one rather than the other), which biases the response based on pre-stimulus expectations and Ter, non-decision time, a fixed delay which does not vary with stimulus information. Additional parameters define noise factors which set factors such as the trial-to-trial variability in drift rate.
For our purposes, the value of these decision models is that they provide a principled reconciliation of speed and accuracy data. Within this framework these observed behavioural measures reflect the hidden parameters of the decision model, most important of which are the drift rate (reflecting the rate of evidence accumulation) and the decision boundary separation (reflecting the conservativeness of the participant’s decision criterion; higher boundaries produce slower but more accurate responses).
By fitting the DDM to our data we can deconfound the observed behavioural variables — speed and accuracy — and recover the putative generating parameters of the decision — drift and boundary separation. In principle, this allows a more sensitive measure of participant capability (reflected in the drift parameter). Drift is a more sensitive measure because a) it is estimated using both speed and accuracy, b) this estimation takes account of both mean response time and the distribution of response times for correct and error responses, and because c) the estimation of the drift parameter is isolated from the effect of different participant’s SATOs (which are reflected in the boundary parameter).
2.4 Prior work
Previous authors have established the principled benefits of this approach (Ratcliff and McKoon 2008). Within a psychophysics framework, Stone (2014) extended Palmer, Huk, and Shadlen (2005)’s decision model to show that response time and accuracy contain different, but possibly overlapping, components of Shannon information about the perceived stimulus.
If these components do not overlap (as suggested by Stone, in preparation) then combining response time and accuracy data should provide better estimates of key parameters which govern the decision process than relying on either response time or accuracy alone. However, our purpose here is not to make a theoretical innovation in decision modelling, but to use established decision models to demonstrate and quantify the benefits of decision modelling for experimentalists.
Previous authors have shown for specific paradigms and decisions that using decision models confers benefits beyond relying on speed, accuracy or some sub-optimal combination of the two, especially in the case of speed-accuracy trade-offs (Zhang and Rowe 2014; Park and Starns 2015). These results use data collected from participants in single experiments. Park and Starns (2015) show that for their data using decision models to estimate a drift parameter allows participant ability to be gauged separately from speed-accuracy trade-offs, and that these estimates consequently have higher predictive value.
Zhang and Rowe (2014) used decision modelling to show that, for their data, it was possible to dissociate behavioural changes due to learning from those due to speed-accuracy trade-offs (revealing the distinct mechanisms of these two processes). In contrast to these studies, our approach is to use simulated data of multiple experiments so as to interrogate the value of decision models across a wide range of possibilities.
Ravenzwaaij, Donkin and Vandekerckhov (2017, henceforth vRDV) has considerable sympathy with the approach we adopt here. They show that the EZ model, for across variations in participant number, trial number and effect size, has higher sensitivity to group differences than the full diffusion model, which they ascribe to its relative simplicity (a striking illustration of the bias/variance trade-off in model fitting Yarkoni and Westfall 2017).
2.5 Contribution of the current work
Our work extends prior work in a number of ways. Our fundamental comparison is in the sensitivity of model parameters compared to behaviourally observed measures (RT, accuracy). Our purpose is not to compare different `measurement models’ (Ravenzwaaij, Donkin, and Vandekerckhove 2017), but to illustrate the benefits for experimentalists of using any decision model over analysing a singular behavioural measure (reaction time or accuracy in isolation).
We use the EZ model, for reasons of computational efficiency, and because prior work has shown that in most circumstances it preserves the benefits of fuller decision modelling approaches. We also confirm that the basic pattern of results holds for other model fitting methods, the HDDM (Wiecki, Sofer, and Frank 2013) and fast-dm (Voss and Voss 2007). We simulate null group effects and so can show false alarm rates as well as calculate results in terms of d’. Our use of d’ allows quantitative comparison and estimation of size of benefit across different speed-accuracy conditions. We explore the combined effects of group shifts in both drift and boundary, and so can show implications of speed-accuracy trade-offs between groups, alongside drift differences.
As with all modelling work, the results we present have always been latent in existing models. Our focus is not on theoretical innovation, but in drawing out the implications of established models in a way that reveals the extent of their value and so promotes their uptake. For a discussion of the contribution of elaborating the consequences of existing models see Stafford (2010);Stafford (2009).
Our results are translated into the power-sample size space, which is familiar to experimental psychologists. Our results are accompanied by an interactive data explorer to aid in the translation of the value of decision models into a form most easily comprehendable by experimentalists. For these reasons we hope that the current work can make a contribution in allowing experimentalists with less model fitting experience to readily apprehend the large benefits of model fitting for decision making data.
The broad approach is to consider a simple standard experimental design: a between groups comparison, where each group contains a number of participants who complete a number of decision trials, providing both response time and accuracy data.
We simulate data for true and null differences in drift rate between the groups, as well as true and null differences in boundary between the groups. By varying the number of simulated participants we generate a fixed number of `scenarios’ defined by true/null effects in ability (drift) between groups, true/null SATOs (boundary) between groups and experiment sample size. We keep the number of decision trials per participant constant for all these analyses.
For each scenario we simulate many virtual experiments and inspect the behavioural measures to see how sensitive and specific they are to true group differences. We also fit the DDM and estimate the participant drift parameters, similarly asking how sensitive and specific estimates of drift are to true group differences. An overview of the method is illustrated in Figure 3.1.
3.1 Decision modelling
To generate simulated response data, we use the Hierarchical Drift Diffusion Model (HDDM, Wiecki, Sofer, and Frank 2013). The toolbox can also perform model fitting, which uses Bayesian estimation methods to simultaneously fit individual decision parameters and the group distributions from which they are drawn.
While the HDDM offers a principled and comprehensive model fitting approach, it is computationally expensive. An alternative model fitting method, the EZ-DDM (E.-J. Wagenmakers, Van Der Maas, and Grasman 2007) offers a simple approximation, fitting a decision model with a smaller number of parameters, assuming no bias towards either of the two options and no inter-trial variability. This allows an analytic solution which is computationally cheap. Furthermore, the EZ-DDM has been shown to match the full DDM for a range of situations (Ravenzwaaij, Donkin, and Vandekerckhove 2017).
For the model fitting presented here (Figures 4.1 - 4.4), we use the EZ-DDM, although initial exploration using both the HDDM and the fast-dm (Voss and Voss 2007, a third model fitting framework) found qualitatively similar results, so our current belief is that these results do not depend on the particular decision model deployed from the broad class of accumulator models2.
Obviously, where we wish to simulate many thousands of independent experiments there are significant speed gains from parallelisation. Parallelisation was done by Mike Croucher, and the code run on University of Sheffield High Performance Computing cluster. A sense of the value of parallelisation can be had by noting the data shown in, for example, Figure 4.4 would have taken around 1 calendar month to generate on a single high performance machine, even though they use the computationally `cheap’ EZ-DDM method. Python code for running the simulations, as well as the output data, figures and manuscript preparation files, is here http://doi.org/10.5281/zenodo.2648995.
Because we are not generating a comprehensive analytic solution for the full DDM we cannot claim that our findings are true for all situations. Our aim is merely to show that, for some reasonable choices of DDM parameters, using decision modelling is a superior approach to analysing response time or accuracy alone, and to quantify the gain in statistical power.
To be able to make this claim of relevance of our simulations to typical psychology experiments we need to be able to justify that our parameter choice is plausible for a typical psychology experiment. In order to establish this we pick parameters which generate response times of the order of 1 second and accuracy of the order 90%. Each participant contributes 40 trials (decisions) to each experiment. Parameters for drift and boundary separation are defined for the group and individual participant values for these parameters are drawn from the group parameters with some level of variability (and, in the case of true effects, a mean difference between the group values, see below for details).
To illustrate this, we show in Figure 3.2 a direct visualisation of the speed-accuracy trade-off, by taking the base parameters we use in our simulated experiments and generating a single participant’s average response time and accuracy, using 1000 different boundary separation values. This shows the effect of varying boundary separation alone, while all other decision parameters are stable.
3.2.1 Simulating experimental data
For each scenario we simulate a large number of experiments, testing a group (‘A’) of participants against another group (‘B’), with each participant contributing 40 trials. Participant parameters (most importantly the drift rate and boundary parameters) are sampled each time from distributions defined for each of the two simulated experimental groups, A and B.
For the simulations with no true difference in sensitivity between A and B the drift rate of each group has a mean of 2 and within-group standard deviation of 0.05. For the simulations with a true difference in drift group B has a mean of 2+δ, where δ defines an increase in the mean drift rate; the within-group standard deviations remain the same. For the simulations where there is no SATO the mean boundary parameter is 2, with a within-group standard deviation of 0.05. For the simulations where there is a SATO, the boundary parameter of group B has an average of 2−δ, where δ defines the size of the decrease in the mean boundary; the within-group standard deviations remain the same.
All simulations assume a non-decision time of 0.3 seconds, no initial starting bias towards either decision threshold and the inter-trial variability parameters for starting point, drift and non-decision time set to 0. Sample sizes between 10 and 400 participants were tested, moving in steps of 10 participants for samples sizes below 150 and steps of 50 for samples sizes above 150. For each sample size 10,000 simulated experiments were run (each of 40 simulated participants in each of two groups).
3.2.2 Effect sizes, observed and declared
The difference between two groups can be expressed in terms of Cohen’s d effect size — the mean difference between the groups standardised by the within group standard deviation. For the observed variables, response time and accuracy, effect sizes can only be observed since these arise from the interaction of the DDM parameters and the DDM model which generates responses.
For drift rate, the difference between groups is declared (by how we define the group means, see above). The declared group difference in drift rate produces the observed effect size in response time and accuracy (which differ from each other), depending on both the level of noise in each simulated experiment, and the experiment design - particularly on the number of trials per participant. Experiment designs which have a higher number of trials per participant effectively sample the true drift rate more accurately, and so have effect sizes for response time and accuracy which are closer to the ‘true’, declared, effect size in drift rate.
This issue sheds light on why decision modelling is more effective than analysing response time or accuracy alone (because it recovers the generating parameter, drift, which is more sensitive to group differences), and why there are differences in power between measuring response time and accuracy (because these variables show different observed effect sizes when generated by the same true different in drift rates). Figure 3.3 shows how declared differences in drift translate into observed effect sizes for response time and accuracy.
3.2.3 Hits (power) and False Alarms (alpha)
For each simulated experiment any difference between groups is gauged with a standard two-sample t-test3. Statistical power is the probability of your measure reporting a group difference when there is a true group difference, analogous to the ‘hit rate’ in a signal detection paradigm. Conventional power analysis assumes a standard false positive (alpha) rate of 0.05. For our simulations we can measure the actual false alarm rate, rather than assume it remains at the intended 0.05 rate.
For situations where only the drift differs between two groups we would not expect any significant variations in false alarm rate. However, when considering speed-accuracy trade-off changes between groups (with or without drift rate differences as well) the situation is different. This means that it is possible to get false positives in tests of a difference in drifts between groups because of SATOs. Most obviously, if a SATO means one group prioritises speed over accuracy, analysis of response time alone will mimic an enhanced drift rate, but analysis of accuracy alone will mimic degraded drift rate. Ideally the DDM will be immune to any distortion of estimates of drift rates, but that is what we have set out to demonstrate so we should not assume.
The consequence of this is that it makes sense to calculate the overall sensitivity, accounting for both the false alarm rate, as well as the hit rate. A principled way for combining false alarm and hit rate into a single metric is d’ (“d prime”), which gives an overall sensitivity of the test, much as we would calculate the sensitivity independent of bias for an observer in a psychophysics experiment (Green 1966).
The results shown here support our central claim that decision modelling can have substantial benefits. To explore the interaction of power, sample size, effect size and measure sensitivity we have prepared an interactive data explorer which can be found here https://sheffield-university.shinyapps.io/decision_power/ (Krystalli and Stafford 2019)
4.1 Without Speed-Accuracy Trade-offs
For an idea of the main implications, it is sufficient to plot a slice of the data when the true difference in drift is a Cohen’s d of 2. Recall, from Figure 3.3 above, that although this is a large difference in terms of the generating parameter, drift, this translates into small observed effect sizes in accuracy and response time (approximately 0.3 - 0.4, reflecting `medium’ effect sizes).
Figure 4.1, left, shows how sample size and hit rate interact for the different measures. The results will be depressingly familiar to any experimentalist who has taken power analysis seriously — a sample size far larger than that conventionally recruited is required to reach adequate power levels for small/medium group differences.
From this figure we can read off the number of participants per group required to reach the conventional 80% power level (equivalent to hit rate of 0.8, if we assume a constant false positive rate). For this part of the parameter space, for this size of difference between groups in drift, and no speed-accuracy trade-off, ~140 participants are required to achieve 80% power if the difference between groups is tested on the speed of correct responses only.
If the difference between groups is tested on the accuracy rate only then ~115 participants per group are required. If speed and accuracy are combined using decision modelling, and difference between groups is tested on the recovered drift parameters then we estimate that ~55 participants per group are required for 80% power. An experimentalist who might have otherwise had to recruit 280 (or 230) participants could therefore save herself (and her participants) significant trouble, effort and cost by deploying decision modelling, recruiting half that sample size and still enjoying an increase in statistical power to detect group differences.
Figure 4.1, right, shows the false alarm rate. When the difference in drifts is a Cohen’s d of 0, i.e. no true difference, the t-tests on response time and accuracy both generate false alarm rates at around the standard alpha level of 0.05.
Figure 4.2 shows the measure sensitivity, d’ for each sample size. In effect, this reflects the hit rate (Figure 4.1, left) corrected for fluctuations in false alarm rate (Figure 4.1, right). This correction will be more important when there are systematic variations in false positive rate due to SATOs. Note that the exact value of d’ is sensitive to small fluctuations in the proportions of hits and false alarms observed in the simulations, and hence the d’ curves are visibly kinked despite being derived from the apparently smooth hit and false alarm curves.
4.2 With SATOs
The superiority of parameter recovery via a decision model becomes even more stark if there are systematic speed-accuracy trade-offs. To see this, we re-run the simulations above, but with a shift in the boundary parameter between group A and group B, such that individuals from group B have a lower boundary, and so tend to make faster but less accurate decisions compared to group A. On top of this difference, we simulate different sizes of superiority of drift rate of group B over group A.
For the plots below the drift rate difference is, as above in the non-SATO case, 0.1 (which, given the inter-individual variability translates into an effect size of 2). The boundary parameter difference is also 0.1, a between group effect size 2.
Unlike the case where there are no SATOs, the response time measure is now superior for detecting a group difference over the drift measure; Figure 4.3, left.
This, however, is an artefact of the SATO. If the boundary shift had been in the reverse direction then accuracy, not response time, would appear the superior measure (see below). Once we compare the false positive rate the danger of using a single observed measure becomes clear, Figure 4.3, right.
When using the drift parameter as a measure the SATO between the groups does not induce false alarms. The accuracy measure is insensitive so also doesn’t suffer (but would if the boundary shift was in the opposite direction). The response time measure is catastrophically sensitive to false alarms, approaching 100% false alarm rate with larger samples.
Figure 4.4 shows d’, which combines hit rate and the false alarm rate, shows that the best measure overall is drift rate, as it is in the no-SATO case.
To confirm our intuitions concerning the effect of a raised decision boundary, as a opposed to a lowered one, we repeat the simulations with the boundary raised up by the same amount as it was lowered for the results shown in Figures 4.3 and 4.4. The results are shown in 4.5 and 4.6. Comparing Figure 4.5 with Figure 4.3 we can see that, with a raised boundary, the accuracy appears the superior measure if hits alone are considered (left), but not if false alarms are taken into account (right). With the boundary raised, and hence more conservative responses, response time is less sensitive to group differences. As with the lowered boundary, it is possible to combine hits and false alarms in a single d’ measure (Figure 4.6), which shows the same superiority of the estimated drift measure in comparison to both raw behavioural measures.
5.1 Main conclusions
We have shown the benefits of fitting response time and accuracy data with standard decision models. Such decision models allow the estimation of the generating parameters of simple perceptual decisions, such that the participants’ sensitivity and response conservativeness are deconfounded. This allows more powerful tests of between group differences, given a set sample size and/or the reduction in required sample for a set statistical power. Some insight into why decision modelling brings these benefits can be gained from Figure 3.2. Here we show that the speed-accuracy trade-off exists as the decision threshold is shifted, and that it has a non-linear shape. Combing speed and accuracy not only provides more information, but cannot be done directly, but instead is best done via an accurate model of the underlying decision processes (such as the DDM).
Inter alia our results show that accuracy can be a more sensitive measure than response time given decision parameters which reasonably reflect a typical experiment. This confirms, in simulation, the result of Ratcliff and McKoon (2008) whose analysis of 18 experimental data sets showed that accuracy better correlated with participant drift rate than response time. Our results also provide some insight into why this is. Figure 3.3 shows that standard between group effect size is more closely matched by generated accuracy than generated response times.
In the presence of systematic shifts in the speed-accuracy trade-off, this approach offers protection against false-positives or false-negatives (in the case that SATOs disguise true differences in sensitivity). Interestingly, under the parameter range used in these simulations, calculation of the d’ sensitivity measure shows that accuracy outperforms response time for SATO in both directions (whether more liberal, Figure 4.4, or more conservative, Figure 4.6).
We do not claim to make theoretical innovation in decision modelling — the work deploys widely used decision models `off the shelf’ and seeks to quantify the extent of the benefit for experimentalists of deploying decision modelling on their behavioural data. The extent of the statistical power gain is considerable. The exact benefit will vary according to the phenomenon and populations investigated, as well as experimental design. For the example design and parameter regime we showcase here the use of decision modelling allows total sample size to be halved while still increasing statistical power. To explore the relation of sample size and effect size to the sensitivity of behavioural measures, and the decision modelling measures, we provide an interactive data explorer here https://sheffield-university.shinyapps.io/decision_power/ (Krystalli and Stafford 2019)
The results we showcase here and in the data explorer hold only for the parameter regime chosen. We have not analytically proved that parameter recovery with the DDM will always provide a statistical power gain. We have chosen a simple experimental design, with a plausible trial numbers per participant and decision parameters which generate realistic values for speed and accuracy of responses, but it is possible that for smaller effects, at the boundaries of maximum or minimum speed or accuracy, and/or with higher within and between participant noise, that decision modelling may not have the benefits depicted here (although it may also have greater benefits than those depicted here as well).
We have choose not to explore a within-participants design because the issue of systematically different speed-accuracy trade-offs between conditions seems, prima facie, less likely. For between groups designs we know of several prominent cases where systematic SATOs confounded conclusions. For example Pirrone et al. (2017) found that an apparent impairment of perceptual judgement among Autism Spectrum Disorder (ASD) participants could be attributed to a difference in their SATO. The ASD group responded more conservatively, but decision modelling showed they had equivalent sensitivity to the non-ASD group. Ratcliff, Thapar, and McKoon (2006) found an analogous result for young vs. old participants on perceptual discrimination and recognition memory tasks.
We expect the statistical power gains of decision modelling to apply to within-participants designs. All other things being equal, between groups designs have lower statistical power than within-participants designs, so it is for between groups designs - which we assume an experimentalist would only deploy if they had no alternative - that decision modelling brings the greatest gains.
As well as occupying a particular point in the parameter space of decision models, our results are also generated using a particular model and model fitting approach (the EZ-DDM, E.-J. Wagenmakers, Van Der Maas, and Grasman 2007), although we have verified that the same qualitative pattern can be produced by alternative approaches (Wiecki, Sofer, and Frank 2013; Voss and Voss 2007). Additionally, it is worth noting that for some parameterisations several prominent decision models are equivalent (Bogacz et al. 2006).
A recent collaborative analysis project found that despite a large diversity of fitting methods common inferences were made across different decision models (Dutilh et al. 2016). A reasonable conclusion from this project was that in many circumstances the simple models should be preferred (Lerche and Voss 2016). Ratcliff and Childers (2015) claim that hierarchical Bayesian methods of fitting, as used by the HDDM are best, at least for individual difference investigations (although see Jones and Dzhafarov 2014 who claim that many variants of the DDM cannot be successfully distinguished by empirical measurement).
Although we have not verified this, we expect to obtain similar results with many established models of decision making - eg the LBA (Brown and Heathcote 2008) or the LCA (Usher and McClelland 2001) - since we have no reason to suspect that our results are only dependent on the specific decision making model used and rather depend on the established ability of a wide class of decision models to capture the regularities in behavioural data from human decisions.
5.3 Wider context
As well as power gains, and protection against SATO confounds, decision modelling has other benefits to offer the experimentalist. It allows differences between participants or groups to be localized to particular components of the decision process. Decision modelling, since it relies on the distribution of responses rather than just the means, can also reveal underlying differences when single variables (eg response time) are stable (White et al. 2010).
There is a growing awareness of the limitations of studying only speed or accuracy alone (Oppenheim 2017). A recent meta-analysis confirms a low correlation between speed and accuracy costs in psychological tasks (Hedge et al. in press). Vandierendonck (2017) compares seven transformations which combine reaction time and accuracy, without use of decision modelling, but finds none unambiguously superior either to the others or to inspecting raw reaction time and accuracy.
5.4 Related work
A recent paper (Hedge, Powell, and Sumner 2018) used a comparable simulation based approach and reached a similar conclusion to ours — that model-free transformations of reaction time and accuracy, even if hallowed by common usage, are outperformed by a model-based transformation which assumes a sequential sampling model like the DDM.
White, Servant, and Logan (2018) also present a parameter recovery account, but compare different variations of the sequential sampling models which are designed to account for decisions under conflict. Their focus is on comparing between different decision models rather than model-free and model-based transformations of reaction time and accuracy
Baker et al. (2019) used the simulation method to address a question of equal importance to experimentalists — how does the number of trials interact with sample size to affect statistical power? Like us they present an interactive demonstration of their findings: https://shiny.york.ac.uk/powercontours/
5.5 Getting started with decision modelling
Those who wish to apply decision models to their data have a range of tutorials and introductory reviews available (Voss, Nagler, and Lerche 2013; Forstmann, Ratcliff, and Wagenmakers 2016), as well as statistical computing packages which support model fitting (Wiecki, Sofer, and Frank 2013; Voss and Voss 2007). Although analysing speed and accuracy data with decision models incurs a technical overhead, we hope we have made clear the considerable gains in both enhanced sensitivity to true differences and protection against spurious findings that it affords.
Decision modelling offers large benefits to the experimentalist, and is based on a principled framework which has seen substantial validation and exploration. No analysis plan can rescue a ill-conceived study, and experimentalists have many other considerations which can enhance statistical power before they attempt decision modelling (Lazic 2018).
Our attempt here has just been to illustrate how, in cases were speed and accuracy are collected from two groups of participants, decision modelling offers considerable power gains, and the attendant increased chances of discovering a true effect and/or reduction of required sample size, without increased risk of false positives. The contribution this paper hopes to make concerns the size of these benefits. These are not just, as could be theoretically shown, non-zero, but they are, under conditions which it is realistic to expect to hold for a typical experiment, consequential and so worthy of the experimentalist’s consideration.
Thanks to Jim Stone and Amber Copeland for discussion of the project and reading the manuscript. The work benefited from feedback, online and offline, following presentation at the International meeting of the Psychonomics Society in Amsterdam, May 2018.
Code for all simulations reported here and the generated data is available in public repositories (see links above).
Note that we choose to cite work by the lead author here for illustration, rather than highlight any other researchers for their use of these suboptimal practices.
Computational constraints mean that systematically confirming this by fully exploring the parameter space presented in this manuscript must be future work.
Note, for high accuracy values t-tests may not be appropriate (they are strictly not applicable to proportions anyway, but this may become a real issue for values very close to 1 or 0).