Visual Functions Laboratory,
Department of Optometry and Vision Sciences,
The University of Melbourne,
Parkville, Vic., 3052
Australia
Purpose: To ascertain the accuracy and precision of the catch trial monitor used to estimate the false response rate in automated perimetry.
Methods: False responses were automatically injected at various rates by modified perimetric software while reliable perimetric subjects underwent visual field thresholding. Four repeat tests were conducted within a one-hour period to quantify the variability and bias inherent in the catch trial technique.
Results: The catch trial method gives a very accurate estimate of the average false response rate. Precision, however, is quite poor due to the small sample of catch trials. Outcomes are predicted using a binomial model and we demonstrate good concordance between the model and empirical data. When the true false response rate is 33%, estimates derived from catch trials ranged from 7-57%.
Conclusions: Although the catch trial method gives an accurate estimate of the false response rate, confidence intervals are too wide to provide a high level of precision. Our data suggest that tests with reported false response rates of <20% may be considered reliable.
Subject reliability is important for any psychophysical test, as reliable subjects are more likely to yield accurate estimates of threshold. Clinical automated perimetry is a good example of a testing paradigm where reliable responses are important since thresholding is usually conducted using a small number of stimulus presentations in the interests of minimizing test time. The two false responses that can affect threshold outcomes are false positives and false negatives.
In clinical perimetry, a catch trial is commonly used to monitor the false positive and false negative rates.1 With catch trial methodology, responses that are made during a pause in stimulus presentation, (with or without mechanized sounds depending on the method of stimulus delivery) give an estimate of the false positive rate. On the other hand, no response to stimuli that are substantially 'brighter' than a previously established threshold are interpreted as false negative errors. Other methods for estimating response reliability have been suggested such as evaluating inconsistent responses2 and using maximum likelihood procedures,3 but these alternative methods are not widely used in clinical perimetry so they will not be considered further in this paper.
In evaluating reliability indices, our previous work4 has shown that high false positive rates can always be taken to indicate poor reliability. However, interpretation of a high false negative rate is more complicated. It can indicate an unreliable patient or it can be elevated in association with a scotoma and unstable fixation.4 Scotomatous regions may give rise to high false negative rates due to their increased short-term fluctuation5 or due to small amounts of fixational instability moving the stimulus across a scotomatous border.4 The effect that false responses have on field data is the subject of a fellow paper.6
Apart from the fact that some of these indices might be elevated in the presence of a scotoma,4 they are the clinician's only guides to patient reliability. Test outcomes can be rendered unreliable by a 'trigger happy' patient who makes many false positive errors or by an inattentive patient who generates many false negative responses. It is known that unreliable patients generate test results that are more variable7 and harder to interpret than do reliable patients and it is also accepted that diagnostic and therapeutic decisions should be forestalled in the presence of unreliable test results. The clinical challenge then is to determine which tests are unreliable. Having accurate and precise indices of performance can assist this process. By definition, accuracy describes how closely an estimator of a variable describes its underlying population mean value, bearing in mind that all measures are estimates of an underlying distribution or population. If multiple estimates are made then a sample mean can be established. An accurate estimator would be one where the sample mean was very close to the population mean, showing little or no bias. Accuracy represents the absolute difference between the sample mean and the population mean. A closely related term in testing is validity, which describes how well a test measures what it, purports to measure.
The term precision describes how much spread exists when multiple estimates or samples are made. A precise estimator will generate closely grouped data when multiple samples are gathered over time. Precision is related to the variance or standard deviation seen in a sample distribution and can be thought of as being most similar to the commonly used expressions 'variability' or 'repeatability'.
An estimator may be very accurate and yet display poor precision. In such a situation, the sample mean would be very close to the population mean but the sample standard deviation would be high. Conversely, accuracy can be poor with good precision where the sample mean is distant from the population mean (a bias) but estimates are grouped tightly.
The accuracy and precision of perimetric reliability indices are affected by how these estimates are collected, in the case of this investigation, via the catch trial monitor. Catch trial monitors interleave false response catch trials with threshold (staircase) directed trials. Since most catch trial monitors use a limited number of presentations (3-5% of total presentations8, 9) in order to reduce test durations, sampling theory predicts that false response estimates may be imprecise representations of the underlying error rates. The small numbers of catch trial stimuli may lead to undersampling of the true false response error distribution. Furthermore, as sampling occurs in a discrete, quantal manner, with poor resolution in time, overestimation can also occur. Both of these occurrences are frustrating since overestimation will lead to unnecessary repeat testing whereas underestimation can lead to the undesirable situation of accepting an unreliable test as a basis for clinical decision making.
The following study was conducted to determine how accurately and precisely the catch trial method detects false responses. Using reliable perimetric subjects and artificially injecting false responses into the perimeter's thresholding routine accomplished this.
The study was approved by our institutional ethics committee and informed consent was obtained from all subjects prior to participation.
The Medmont M600 perimeter was used in this experiment and an
early model has been described in detail elsewhere.9
In short, this bowl perimeter uses green LED's as stimuli (
max
= 565nm, 20nm half bandwidth, maximum intensity at 0dB of 1000 asb =
318 cd/m2) covered by a diffusing surface to eliminate the
'black hole' effect present in some LED perimeters. Background
luminance is 10 asb (3.18 cd/m2). Thresholds are obtained
using a 6/3-dB staircase with two reversals at the 3-dB step size
used to establish threshold. Full thresholding with the 'Central 100'
pattern (<30º, 99 stimuli + one blind-spot monitor) was
performed.

Figure 1. A schematic representation of the algorithm used to 'inject' false responses during testing. False positives were only injected if the subject did not respond to the stimulus and false negatives were only injected if the subject did respond to the stimulus.
Modified thresholding software was developed for this experiment. This worked by intercepting and modifying (if needed) the subject's response prior to it being returned to the threshold module of the program, as shown schematically in Figure 1. With this modification, we could inject false responses at any specified integer rate from 0-100% during the testing process. An injected false positive returned a 'seen' category whenever the subject did not press the response button whereas an injected false negative returned a 'not seen' category in the presence of a button press. False positives were only injected when the subject registered 'not seen' responses and false negatives were only injected when 'seen' responses were registered. For example, if the false response generator was set to 10% false positives, any stimulus that was not seen by the subject had a probability of 0.1 of being returned as 'seen' to the thresholding module. Likewise at 33% false negative injection, 'seen' responses would have a probability equal to 0.33 of being returned as 'not seen'. Thresholding decisions (step sizes, criterion reversals, confirmations, etc.) were not altered neither were the number of false response catch trials. The catch trial monitor was subjected to modification in an effort to quantify the accuracy of this method for false response detection. Responses to fixation monitoring trials (Heijl-Krakau) were not altered by the injection algorithm. This measure was employed to ensure that information regarding fixational reliability was not corrupted by the injected false response rate. Nevertheless, the fixation loss index is confounded with the subject's false positive rate, as it is impossible to separate these factors. Hence, the fixation loss index represents a combined contribution of real fixation loss and the underlying false positive rate. This index can be used to ensure that our subjects were actually responding in a reliable manner insofar as fixation and false positives are concerned.
Three young males (ages 23 to 26) with normal ocular and systemic health were used as subjects. Habitual distance optical correction was worn by subjects as needed without near addition. The subjects were all highly practiced with automated perimetry and were reliable observers making few false responses (range 0-6% recorded). Two of the subjects were naive to the purpose of the study; their instructions were to try to complete the perimetric tests as reliably as possible, making no false responses or fixation losses.
A single session consisted of four threshold tests with each lasting 11-13 mins. One of several levels of false response was injected within a single session. One observer was tested extensively with six false response rates (10, 20, 25, 33, 42 and 50%) whereas the other two had a limited number of rates applied (10, 33 and 50%). All observers performed testing with the 0% false response condition in order to quantify their individual level of reliability. False response rates were presented in a randomized order. A complete examination of all false response rates (6 false positive rates + 6 false negative rates + 0% control) required multiple 1-hr test sessions. The primary subject completed these 13 sessions over a five-day period whereas the other two subjects performed their limited set of seven sessions (3 false positive rates + 3 false negative rates + 0% control) over a one-month period.
This study depends on the assumption that test subjects were reliable and that only the injected false responses influenced the performance of the catch trial monitor. This was confirmed by three findings. In the control condition (0% false response injection) all observers were found to be very reliable (mean false responses ±1 std.dev, S1=0±0%, S2=0±0% and S3=1.9±3.5%).
Fixation was also very stable (mean±1 std.dev,
S1=2.4±1.7%, S2=1.9±1.7% and S3=2.9±1.9%). As it has
been previously noted, the reported fixation loss rate represents a
combination of the fixation loss rate and the underlying false
positive rate; the low values (
2%)
returned by the perimeter are an indirect confirmation of subject
reliability.
Another finding that would have invalidated our assumption of subject reliability is evidence of a fatigue effect. We sought this effect by examining test sensitivity as a session progressed. Of the 27 sessions examined, four sessions were identified where a significant effect of test order was observed (ANOVA). Of these four sessions, only one showed a trend for worsening sensitivity as the session progressed consistent with a fatigue effect. The other three sessions showed a tendency for sensitivity to improve during the session or no particular trend (for example the 2nd test could have had slightly lower sensitivity than the 1st but the 3rd and 4th tests did not follow the trend).
Given these confirmations of the assumption of subject reliability, it was then possible to consider the accuracy and precision of the catch trial monitor under the influence of the injected false responses alone.
Only 'same effect' analysis were performed, that is, false negative catch trial estimates were examined when false negatives were injected and false positive catch trial estimates were examined when false positives were injected. Statistical analysis showed that catch trial estimates derived from false negative and false positive injection were not significantly different (paired t-test; t=1.63, df=47, p=0.11). Consequently, all tests with the same injected false response rate (regardless of type) have been combined and averaged in the following analysis.

Figure 2. The percentage of 'hits' reported by the catch trial monitor versus the injected rate of false responses. Filled symbols show the average of three subjects, unfilled symbols show data from the extensively tested subject. Error bars are ±1 standard deviation. The unfilled symbols have been shifted along the abscissa (2%) for clarity. The solid diagonal line shows perfect correspondence and the dashed line is the best fitting linear regression to the filled symbols.
Figure 2 gives an indication of both the accuracy and precision of the catch trial monitor. In this figure, the injected false response rate is plotted along the abscissa with the detected false response rate along the ordinate. Filled symbols give averages (±1 std.dev.) for the three subjects whereas open symbols show averages (±1 std.dev.) for the subject who was extensively tested. If there were ideal correspondence between catch trial estimates and the injected false response rate the data would be expected to fall along a straight line passing through (0, 0) with unit slope. The dashed line depicts this relationship. For comparison, the continuous line is the best fitting linear function to the average data for the three observers (filled symbols). The accuracy of the average false response estimate is clear, the dashed and solid lines being almost coincident. The length of the error bars confirms the poor precision.
It can be concluded from the near colinearity of the ideal (dashed) and fitted (solid) lines in Figure 2 that the average false response rate estimated by the catch trial monitor is an accurate representation of the true (injected) rate. However, the spread depicted by the error bars reveals great imprecision in this estimate. Our data show that when 33% false responses were injected (24 such tests were performed) the estimated false response rate ranged from 7-57%. Such variability can limit a clinician's capacity to identify the true level of false response and thus hamper determination of how reliably a test has been performed. The reason for this imprecision is considered below.
The behaviour of the catch trial monitor can be predicted by modeling the responses assuming independence between events and that a binomial distribution describes these discrete events. If the number of catch trials is known, along with the true false response rate, then the probability of returning x catch trial hits, P(x), can be calculated using the binomial distribution as shown in Equation 1.
Equation 1
where
N = the number of catch trials
x = the number of false response hits
p = the probability of a hit (the injected false response rate in this case)
q = the probability of 'not a hit' (1-p) and,
! = the factorial function
Figure 3 shows the predictions of this model. The solid lines in Figure 3 depict the boundary containing the 5th and 95th percentiles from Equation 1. The group average data (filled symbols) from Figure 2 are repeated for comparison. For these calculations, N was taken as the modal number of catch trials presented across all tests in this study, found to be 14. This plot confirms our empirical findings. It shows that, on average, the catch trial monitor should return accurate findings. However, it also predicts that monitor precision will be poor, exhibiting a high degree of variability about the average estimate. This variability is stochastic in nature and determined by the statistics governing the detection of discrete events with small samples. In actuality, there is no 5th and 95th percentile in a binomial distribution based on 14 samples and the contours of Figure 3 show the next realizable percentile that contains the 5th (lower curve) and 95th (upper curve) percentiles.

Figure 3. Results from binomial modeling (using Equation 1) of catch trial events. The diagonal line shows perfect linear correspondence between injected and estimated rates and the contours show the 5th (lower curve) and 95th (upper curve) percentiles for the model. Details regarding the derivation of this relationship have been given in the text. The symbols and error bars redraw the group average data of Figure 2.
The empirical data of Figure 2 shows that when the injected false response rate was 33%, the estimated level ranged between 7-57%. This corresponds closely to the lower and upper confidence intervals established using Equation 1 as displayed in Figure 3. Equation 1 also suggests that it is extremely unlikely (p<0.01) that not a single false response will be recorded from 14 samples if the true rate is 33%.
The results reported in the empirical study are supported by the modeling, and are cause for some concern. There is evidence that even low false response rates can produce detrimental effects on perimetric thresholds and can significantly alter the summary indices that are used for diagnosis and monitoring of progression.6, 7, 10-11 Since false responses can produce changes in perimetric outcomes it is desirable that accurate and precise estimates of the underlying false response rate be available. However, the data and modeling would suggest that information provided by catch trial monitors is far from precise.
The practical significance of this finding is shown in Table 1 where the predictive capacity of the false response catch trial monitor is outlined. In this table, false response rate estimates returned by the perimeter are given in the left column. Subsequent columns report how high the true rate of false response could be for a given proportion of tests. Modeling predicts that if clinicians wish to accept only those tests that are reliable for decision-making, they must set a low criterion for rejection based on monitor outcomes (Table 1). For example, any test that returns a 14% estimate from the false response monitor (third entry, left column, Table 1) could actually harbour a true false response rate that is much higher namely 36% (1 in 20 tests), 32% (1 in 10 tests), 26% (1 in 5 tests) or 24% (1 in 4 tests). These calculations show that any test which returns a false response rate of >21% must be considered unreliable as the confidence limits suggest that the true false response rate could be higher than 33% on at least one in five occasions. In a clinical context where a true false response rate of 33% is commonly deemed as the upper limit of acceptability, using an estimate of <20% as the cut-off criterion will provide a reasonable balance between high rejection rates of reliable tests and the possibility of unreliable performance going undetected. Calculations indicate that use of this criterion (a catch trial estimate <20%) will enable most cases of unreliable performance (about four out of five tests with an underlying false response rate >33%) to be identified.
|
|
||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Table 1. A guide that can be used to predict test reliability for a given estimated false response rate returned by the perimeter (left column). The true false response rate will be at least as high as the tabulated values as often as is indicated at the top of the column. More details are given in the text.
Patient testing in a clinical setting presents additional problems. It is well accepted that patients display a 'fatigue effect' in automated perimetry as a test progresses, particularly those patients with visual field abnormality. 12, 13 Not only does sensitivity decline but patients will also tend to become less reliable toward the end of a test. 12, 14 In light of this fatigue effect, the type of false negative catch trial takes on greater significance. Some perimeters present false negative catch trials at maximum intensity while others use a stimulus that is a fixed dB interval above the previously established threshold. If a patient displays a significant fatigue effect, a fixed dB increment false negative catch trial may only be slightly suprathreshold (or possibly even sub-threshold) later in the test. This may be a case for the use of maximum intensity false negative catch trials in clinical perimetry. In addition, the percentage of stimuli devoted to catch trials is device dependent. The binomial modeling performed in this study used the modal number of catch trials per test performed by the Medmont perimeter (as previously mentioned, this was found to be 14). Fewer catch trials, or more correctly, a lower percentage of stimuli devoted to catch trials, will lead to worse precision than reported here. Patient factors (such as fatigue and a higher rate of false negatives in diseased eyes) and devise factors (such as the percentage of stimuli devoted to catch trials and the relationship between catch trial intensity and threshold) may make the conclusions made in this article an under-estimation of the limitations of catch trial monitoring in clinical automated perimetry.
Naturally, it is up to the individual to draw their clinical judgement to bear when deciding on a cut off criterion for reliability. This criterion is definitely not fixed at a particular value, but is modified in the light of other information. The suggested cut off (<20%) to denote 'reliability' made in this paper is intended as a guide only.
It appears that the catch trial method of monitoring false responses using small samples is imprecise but what would be the 'price' of improving precision? The obvious cost would be prolonged test duration. Currently, approximately 6% of a test's duration is attributable to false response catch trials. If we were to half the variability in catch trial estimation (with a concordant improvement in precision) it would be necessary to increase the number of catch trials by a factor of four. This alteration would increase test duration by about 18%, raising the catch trial contribution to test duration to just over 20%. At a time when the push in clinical perimetry is toward minimizing test duration, it would seem somewhat farcical to suggest increasing the number of catch trials. What would be the benefit if this change were made? A more precise estimate of the underlying false response rate without doubt, but of what clinical significance would this be? This depends on the effect that false responses have on perimetric outcomes. If the effect of false responses changes only slowly as the rate increases then mis-estimation may not be important. If on the other hand, the impact of false responses changes quite rapidly with an increasing error rate then mis-estimation could be very detrimental to clinical decision-making. The effect of false response on perimetric outcomes is the topic of a companion paper.6
Catch trials generate accurate, yet imprecise, estimates of the true level of false responses occurring during automated perimetry. The imprecision is stochastic in nature and arises from the small sample sizes used to estimate these parameters. Empirical data can be modeled using binomial statistics. Since catch trial estimates provide, in most situations, the clinician's only means for assessing patient reliability, we suggest that a criterion of >20% false responses be used to flag unreliable tests. We conclude that more precise methods of false response estimation deserve investigation, as do the effects of false responses on perimetric outcomes.