Statistical Background Information

Reliability and Validity


Reliability in the sense of the quality of measurement refers to the "consistency" or "repeatability" of measurements. It is the extent to which the measurements of a test remain consistent over repeated tests of the same participant under comparable conditions. If a test yields consistent results of the same measure, it is reliable. If the repeated measurements produce different results, the test is not reliable. If for example an IQ test yields a score of 90 for an individual today and 125 a week later, it is not reliable. The concept of reliability is illustrated in Figure 1. Each point represents an individual. The x-axis represents the test results in the first measurement and the y-axis represents the scores of the second measurement with the same test. Figures 1a-c represent tests of different reliability. The test in Figure 1a is not reliable. The score a participant achieved in the first measurement does not correspond at all with the test score of the second measurement.

The reliability coefficient can be calculated by the correlation between the two measurements. In Figure 1a the correlation is near zero, i.e. r = 0.05 (the theoretical maximum is 1). The test in Figure 1b is somewhat more reliable. The correlation between the two measurements is 0.50. Figure 1c shows a highly reliable test with a correlation of 0.95.

Figure 1: Image based factors relevant in X-ray screening: a) Bag Complexity, b) Superposition, c) View Difficulty.
Figure 1. Illustration of different correlation coefficients. Left: r = 0.05, middle: r = 0.50, right: r = 0.95

The reliability of a test may be estimated by a variety of methods. When the same test is repeated (usually after a time interval during which job performance is assumed not to have changed) the correlation between scores achieved on the two measurement dates can be calculated. This measure is called test-retest reliability. A more common method is to calculate the split-half reliability. With this method, the test is divided into two halves. The whole test is administered to a sample of participants and the total score for each half of the test is calculated. The split-half reliability is the correlation between the test scores obtained in each half. In the alternate forms method, two tests are created that are equivalent in terms of content, response processes, and statistical characteristics. Using this method, participants take both tests and the correlation between the two scores is calculated (alternate forms reliability). Reliability can also be a measure of a test's internal consistency. Thereby the reliability of the test is judged by estimating how well the items that reflect the same construct or ability yield similar results. The most common index for estimating the internal reliability is Cronbach´s alpha. Cronbach´s alpha is often interpreted as the mean of all possible split-half estimates. Another internal consistency measure is KR 20 (for details see the handbooks mentioned above).

Acceptable tests usually have reliability coefficients between 0.7 and 1.0. Correlations exceeding 0.9 are not often achieved. For individual performance to be measured reliably, correlation coefficients of at least 0.75 and Cronbach´s alpha of at least 0.85 are recommended. These numbers represent the minimum values. In the scientific literature, the suggested values are often higher.


Validity indicates whether a test is able to measure what it is intended to measure. For example, hit rate alone is not a valid measure of detection performance in terms of discriminability (or sensitivity), because a high hit rate can also be achieved by judging most bags as containing prohibited items. In order to measure detection performance in terms of discriminability (or sensitivity), the false alarm rate must be considered, too (for different detection measures see MacMillan & Creelman, 1991, and Hofer & Schwaninger, 2004).

As for reliability, there are also different types of validity. The term face validity refers to whether a test appears to measure what it claims to measure. A test should reflect the relevant operational conditions. For example if a test for measuring X-ray image interpretation competency contains X-ray images and screeners have to decide whether the depicted bags contain a prohibited item, it would be face valid. Concurrent validity refers to whether a test can distinguish between groups that it should be able to distinguish between (e.g. between trained and untrained screeners). In order to establish convergent validity it has to be shown that measures that should be related are indeed related. If for example threat image projection (TIP, i.e. the insertion of fictional threat items into x-ray images of passenger bags) measures the same competencies as a computer-based offline test, one would expect a high correlation between TIP performance data and the computer-based test scores. Another validity measure is called predictive validity. In predictive validity, the test?s ability to predict something it should be able to predict, is assessed. For example a good test for pre-employment assessment would be able to predict on-the-job X-ray screening detection performance. Content validity refers to whether the content of a test is representative of the content of the relevant task. For example, a test for assessing whether screeners have acquired the competency to detect different threat items in X-ray images of passenger bags should contain X-ray images of bags with different categories of prohibited items according to an internationally accepted prohibited items list.

Standardization / developing population norms

The third important aspect for judging the quality of a test is standardization. This involves administering the test to a representative group of people in order to establish norms (normative group). When an individual takes the test, it can then be determined how far above or below the average her or his score is, relative to the normative group. It is important to know how the normative group was selected, though. For instance, for the standardization of a test used to evaluate the detection performance of screeners, a meaningful normative group of a large and representative sample of screeners (at least 200 males and 200 females) should be tested.

In summary, competency assessment of X-ray image interpretation needs to be based on tests that are reliable, valid and standardized. However, it is also important to consider test difficulty, particularly if results from different tests are compared to each other. Although two tests can have similar properties in terms of reliability, an easy test may not adequately assess the level of competency needed for the X-ray screening job.

Signal Detection Theory - d´ and A´

Figure 2: Seven ROC curves, each of which correspond to a different d´ value  (0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0). The higher the d´ value, the higher the detection performance. For example screener A has a detection performance of d´ = 2.5, which represents a much better detection performance than screener B and C with d´ = 0.5. A2 is the same screener as A but with a more liberal response bias. The same is true for B2 and B.

Figure 2: Seven ROC curves, each of which correspond to a different d´ value (0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0). The higher the d´ value, the higher the detection performance. For example screener A has a detection performance of d´ = 2.5, which represents a much better detection performance than screener B and C with d´ = 0.5. A2 is the same screener as A but with a more liberal response bias. The same is true for B2 and B.

Signal detection theory provides methods for calculating detection measures that are independent of subjective response biases and thereby provide valid indicators of threat detection performance. This allows identifying screeners who can detect forbidden objects very well and at the same time are good in correctly identifying harmless bags. The curves in Figure 2 are called receiver operating characteristics, or simply ROC curves. They represent a graphic description of how the hit rate of an observer changes as a function of changes in the false alarm rate. Each ROC curve is related to a different detection performance, which is indicated by the measure d´ (or sensitivity). This measure is calculated by the formula d´ = z(hit rate) − z(false alarm rate) and has to do with the distance of an ROC curve from the diagonal. In the formula z denotes the z-transformation, i.e. the hit rate and the false alarm rate are converted into z-scores (standard deviation units). For example Person B in Figure 2 has a high hit rate but also a high false alarm rate. Consequently, d´ = z(hit rate) − z(false alarm rate) is relatively small and the person is on a ROC curve with a low d´ value, namely d´ = 0.5. In other words, this person has a very low detection performance and achieved a high hit rate just by judging most bags as being NOT OK. Security is achieved at the expense of efficiency. In contrast, what you are looking for in order to increase airport security performance is someone like person A. This screener has a high hit rate and a low false alarm rate. Security is achieved without sacrificing efficiency, which is reflected by a high d´ value. As you can see in Figure 2, person A is on the ROC curve which corresponds to d´ = 2.5, indicating a much better detection performance than the one of screener B. A very useful property of signal detection theory is the fact that the detection measure d´ is independent of subjective response biases. This is very important, because response biases influence the hit rate and they are dependent on a variety of factors such as the subjective probability of occurrence of certain threat objects, expected costs and benefits of the response, personality, and job motivation. For example the subjective probability that weapons, knifes and other forbidden objects could occur in cabin baggage has increased immediately after September 11, 2001. Of course detection performance (d´) could not change from one day to the other. Person A in Figure 2 remained a better screener than person B. What changed immediately is the response bias. Most screeners shifted their response bias towards responding more often with NOT OK, which is illustrated in Figure 2 by the changed positions A2 and B2. Note that the detection performance d´ remained the same, both screeners remained on their own ROC curve. Subjective response biases also differ from one person to another. For example screener C in Figure 2 has a more "conservative" response bias than person B, which results in a lower false alarm rate. But because the hit rate is also much smaller, screener C has the same low detection performance as screener B (note that in Figure 2 both screeners are located on the ROC curve corresponding to a relatively low d´ value of 0.5). A second reason for the shift in response bias as a reaction to September 11, 2001 is the fact that the subjective costs produced by long waiting lines became relatively small when compared to the subjective costs of missing a threat object, and this was realized by anybody suddenly, including passengers. Last but not least, hand searching bags after screening them is time consuming and can be stressful if passengers do not cooperate well. Therefore, the personality of the screener and its job motivation are other factors, which can influence subjective response biases. Whereas response biases can change rapidly and can also be influenced by external factors like screener incentive programs, an increase of true detection performance (d´) is more difficult to achieve and requires training. Last but not least it should be mentioned that signal detection theory is often used to measure the detection performance of machines. Simply imagine that the letters A, B, and C in Figure 2 were automatic explosive detection systems from different vendors. Because machine A has the highest detection performance (d´) you would invest in this technology. Especially if you knew that the position on the ROC curve ("response bias") can be changed by adjusting a detection threshold.

Another well known measure, which is "nonparametric" (or sometimes also called "distribution-free") is A´ and was first proposed by Pollack and Norman (1964). The term "nonparametric" refers to the fact that the computation of A´ requires no a priori assumption about underlying distributions. A´ can be calculated when ROC curves are not available and the validity of the normal distribution and equal variance assumptions of the signal-noise and noise distribution cannot be verified.

A non-parametric measure of detection performance is A´ which is calculated using hit and false alarm rates adopting the following formulas:

If the hit rate is larger than the false alarm rate:

A´ = 0.5 + (H-F)(1+H-F)/[4H(1-F)]

If the false alarm rate is greater than the hit rate:

A´ = 0.5 - (F-H)(1+F-H)/[4F(1-H)]

The advantage of A´ over d´ is that it requires no a priori assumption about the underlying noise and signal plus noise distributions. For further information on these and other detection measures see Hofer and Schwaninger (2004).


  • Green, D. M., & Swets, J. A. (1966). Signal Detection Theory and Psychophysics. New York: Wiley.
  • Grier, J. B. (1971). Nonparametric indexes for sensitivity and bias: Computing formulas. Psychological Bulletin, 75, 424−429.
  • Hofer, F. & Schwaninger, A. (2004). Reliable and valid measures of threat detection performance in X-ray screening. IEEE ICCST Proceedings, 38, 303−308.
  • MacMillan, N.A., & Creelman, C.D. (1991). Detection theory: A user´s guide. Cambridge: University Press.
  • Pastore, R. E., Crawley, E. J., Berens, M. S., & Skelly, M. A. (2003) "Nonparametric" A´ and other modern misconceptions about signal detection theory. Psychonomic Bulletin & Review, 10(3), 556−569.
  • Pollack, I. & Norman, D. A. (1964) "A non-parametric analysis of recognition experiments," Psychonomic Science, vol. 1, pp. 125−126.
  • Contact Us | Site Map
    Designed by Katerina Danailova