Determining Comparability of Bioassessment Methods and Their Results

Jerome Diamond, James Stribling

Tetra Tech, Inc., Owings Mills, MD

Chris Yoder

Ohio EPA, Columbus, OH



A "true" or accurate bioassessment is difficult to document, even for a specified time and place, because of the heterogeneous spatial and temporal distribution of species present. Unlike chemical analytical assessments, in which method accuracy can be verified in a number of ways, biological assessment (i.e., field) accuracy can not be objectively verified; we are unable, for example, to conduct meaningful "matrix spikes" for biological information in aquatic systems. Currently, a multitude of biological collection and data interpretation methods are used by different organizations in the U.S. The bioassessment information collected by these different organizations is useful almost exclusively to the individual organization sponsoring the program. Monitoring groups outside the collecting agency, typically find it difficult to know which bioassessment information may be used by them with confidence. The result is limited data sharing across organizations because the quality of "foreign" data is suspect or unknown. In some cases, different bioassessments yield conflicting interpretations at the same sites, underscoring the accuracy of bioassessment results.

The current use of many bioassessment methods, with little or no information as to the comparability of results obtained by these different methods, is a significant problem in three ways: (1) assessments of aquatic resources on broad geographic scales (basins for example) or from state to state are not easily feasible because different methods may be used in different parts of the region of interest, (2) opportunities for increased resource efficiency or for minimizing duplication of efforts are missed, and (3) depending on which bioassessment results are chosen, the quality of biological resources present, and/or trends in the status of those resources over time, may be misinterpreted.

Bioassessments generally consist of three major facets: field data collection (sampling gear and sampling protocol), data summarization and reduction (metric calculations, indices) and decision or interpretive framework (ecoregional reference conditions, site-specific reference or control). Thus, comparison of interpretive results among bioassessment methods, is underlain by many complex interactions involving collection and analysis of data. We believe that the degree of data comparability between bioassessment methods can be traced to the type of data collection methods used and their performance characteristics. This paper presents a framework for characterizing and comparing bioassessment methods that uses a performance-based system. We discuss the advantages of using such a system and present a proposed framework for defining performance criteria and judging bioassessment comparability. Lotic benthic macroinvertebrate bioassessment methods are used as examples in this paper because the authors are most familiar with these methods and because benthic macroinvertebrates are probably the most widely used bioassessment indicator in state and federal water quality programs in the U.S. and in other countries including Canada, Great Britain, and Australia.


Definition and Components of a Performance—Based Methods (PBMS) Approach

There are two general approaches for acquiring comparable bioassessment data. One way is to have every program use the same method. In the past, the US Environmental Protection Agency (USEPA) and analogous agencies in other countries have attempted to pursue this option. The development of Rapid Bioassessment Protocols (RBPs) within the US EPA, for example, was an attempt, in part, to standardize bioassessment methods in the U.S. However, forcing all collecting organizations to use a single bioassessment method, no matter how exemplary, is probably not feasible because different regions or lotic habitats require different sampling methods and because it is not likely that the current establishment of different methods can (or should) be reversed.

An alternative approach to acquiring comparable data from different organizations, and one recommended by the Interagency Task Force on Water Quality Monitoring (USGS 1995) for all methods, is to develop a performance-based approach for characterizing methods, and for setting quantifiable, realistic performance criteria. In this approach, the data quality requirements for a particular bioassessment program are specified in advance and the data collecting entity can select the appropriate method to meet those specifications. This is termed a performance-based measurement system (PBMS). A PBMS is defined as a system that permits the use of any appropriate sampling and analysis method that demonstrates the ability to meet established data criteria and complies with specified data quality requirements or data quality objectives (DQOs). DQOs include requirements for method precision, bias, sensitivity, detection limit, and range of conditions or matrices over which the method yields satisfactory data. With the successful introduction of the PBMS concept in laboratory analytical chemistry testing, and more recently in laboratory toxicity testing (USEPA, 1994), it appears worthwhile to examine the possibility of transposing such a system to the problem of bioassessment method comparability.

In order for the PBMS approach to work, some basic concepts must be defined including: data quality objectives must be set that realistically define and measure the quality of the data needed; reference (validated) methods must be made available that at least meet those data quality objectives; to be considered satisfactory, an alternative method must be as good or better than the reference method in terms of its resulting data quality characteristics; there must be proof that the method yields reproducible results that are sensitive enough for the program or sponsor needs; and finally, the method must be adequate over the prescribed range of conditions in which the method is to be used (USGS 1995). In a bioassessment context, the above concepts imply that the quality of the data generated by a given bioassessment method (i.e., its precision, sensitivity to different levels or types of impairment, range of habitats over which the collection method yields a specified data precision or sensitivity) is known and quantified (validated).

Table 1 summarizes some ways in which analytical chemistry methods define certain performance characteristics. As an example, we compare these performance demonstration techniques with those that have been used by different organizations to define performance characteristics for laboratory sorting and taxonomic identification of benthic macroinvertebrate samples. It is evident that many of the same method performance characteristics can be quantified for laboratory procedures used in bioassessments. Although such validation has been performed by a number of organizations and for certain bioassessment methods, rarely are the performance characteristics quantified for comparison purposes or to explicitly demonstrate to prospective users that the method actually meets program DQOs. For example, a sorting and identification method could, through repeated examinations using trained personnel, determine that the rate of missed organisms is less than 10% of the sample and that taxonomic identifications (to the genus level) have an accuracy rate of at least 90% ( as determined by check samples identified by recognized experts). If such laboratory accuracy and completeness were believed to be necessary for a given study, the study sponsor could require the above data quality characteristics as DQOs. In this case, the above method meets the DQOs and could be considered the reference method. In a PBMS approach, any other laboratory method that documented the attainment of at these DQOs would be yield data comparable to the reference method and the results would therefore be satisfactory for the study.

The above example underscores the important issue of personnel training that is central to most data collection methods, and bioassessment methods in particular. The performance of any method depends on having adequately trained people. One way to document satisfactory training is to quantify performance characteristics of the method using newly trained personnel and comparing these characteristics to those established previously and considered acceptable. While this is frequently done for new field crews and new laboratory personnel in many organizations, rarely are the results of such training documented or quantified. As a result, the organization can not assure either itself or other potential data users, that different personnel performing the same method yield comparable results and that data quality specifications of the method are being consistently met.

To demonstrate the PBMS framework in a bioassessment context, precision is taken as an example of a performance characteristic. Method precision could pertain to many aspects or subprocedures used in biological assessments. A key factor unique to developing a PBMS framework for bioassessment methods is that bioassessments often consist of several subprocedures that are tightly linked (Figure 1). Thus, a comprehensive characterization of a complete bioassessment method may entail a definition of applicable performance characteristics for each sub-procedure. Precision with respect to sampling procedures, for example, could be determined by examining specific metrics at a given site using replicate samples taken from the site. Alternatively, precision of the interpretive bioassessment framework might be determined by examining specific metrics or assessment scores across supposed replicate reference sites in a given ecoregion and within a specific stream reach classification. Once data precision is quantified for different bioassessment methods, it is possible to: (1) derive an overall precision criterion, (2) designate a reference method that meets this criterion, and (3) assess the degree to which different methods yield comparable data precision. Other performance characteristics such as performance range, method interferences, and matrix applicability, also would be used to derive performance criteria and quantify bioassessment comparability. While some of this information is published for certain bioassessment methods, much of this knowledge is incorporated in an informal manner and not quantified within the framework of the method itself (e.g., Peckarsky 1984; Resh and Jackson 1993). This information needs to be more available to data users and organizations so that the quality of data obtained by different methods is documented and one can then judge whether results obtained from different methods are comparable.

In defining a reference method for a given bioassessment procedure, it is imperative that the specific range of environmental conditions are quantitatively defined. In lotic benthic macroinvertebrate bioassessment methods, the performance range or applicable environmental conditions for the method is usually addressed qualitatively by including factors such as stream size, hydrogeomorphoric reach classification, and general habitat features (riffle vs pool, shallow vs deep water, rocky vs silt substrate). In a PBMS framework, different methods could be classified based on the methods’ ability to achieve specified levels of performance characteristics such as precision and sensitivity to impairment over a range of appropriate habitats. In this way, the performance range of bioassessment methods can be directly and quantitatively compared.


Advantages of a PBMS Approach for Characterizing Bioassessment Methods

In performing a benthic macroinvertebrate assessment, two fundamental concerns are of interest: that the sample taken and analyzed is representative of the site or the population of interest and that the data obtained are an accurate reflection of the sample collected and analyzed. The first concern is addressed through appropriate field sampling procedures, including site selection, sampling device, and sample preservation methods. These sampling methods will be dictated, to a certain extent, by the desired DQOs. The second concern is addressed by using appropriate laboratory or analysis procedures. In a PBMS framework, the appropriate question is what minimum precision (variance) and "accuracy" (value of the metric or score compared to the "true" or unchanging value, given the bioassessment method) are required for particular program or study needs? One could, for example, decide that the variance and completeness generated on average by four surber samples at a site are acceptable because the increase in data "accuracy" and precision from further sampling is not enough to warrant the increased effort and cost of obtaining those data; i.e., the bioassessment interpretation will not be changed substantially by the increased effort.

Using a PBMS framework, the question is not which method is more "accurate" or precise but rather what accuracy and precision level can a method consistently achieve and do those performance characteristics meet the DQOs of the program such that bioassessment interpretations can be justified. Furthermore, once data precision and "accuracy" are quantified for a bioassessment method, error rates can be estimated so as to determine whether the method will meet DQOs for a particular study or program. The method may be modified perhaps (i.e., more replicate samples taken, larger samples taken) to improve the precision and "accuracy" of the method, reduce Type II and Type I error rates, and therefore meet more stringent DQOs of other programs or studies.

In benthic macroinvertebrate collection methods, many measures or metrics are potentially determined for the same sample. Together, these measures may form an index or score (IBI, ICI, references) or alternatively, a multivariate analysis and, eventually, a narrative rating of status (Figure 2). Method comparability could be determined if one knew how a particular metric of interest or assessment score behaves under different environmental conditions (impaired vs. reference sites, different habitat types, different seasons, or index periods) for each method. Such information (obtained through repeated sampling at different times in the same location and sampling in different habitats and locations at the same time) would yield estimates of method bias, precision, interferences, and performance range.

Objectives of the data users will define which measures(s) and what environmental conditions should be used to determine comparability among methods. DQOs also will dictate how similar certain performance parameters need to be to consider the data obtained from two different methods, comparable. It is quite possible that two methods may be very comparable for certain measures of interest and not others. Knowing this, one could use data for those measures where different methods are comparable. Alternatively, two methods, differing only in their sample processing procedures, for example, can be relatively easily compared over a broad range of field sampling conditions by knowing the performance characteristics of the other procedures for either method. The key is that performance characteristics are defined for each method and that the data user has access to comparability information when reviewing the data or deciding whether to use data collected by another method.

The PBMS framework is especially useful for comparing bioassessment methods having different collection methods and different metrics or indices. An example illustrating this issue is Ohio EPA’s comparison of data derived using Hester-Dendy (artificial substrate colonization) collection method and a rigorous empirical classification/interpretation framework, and a volunteer monitoring method based on kick net sampling. These methods examined different taxa and developed different measurements or metrics. Comparison of results using the two methods at the same sites showed that the most informative performance characteristics for comparison were sensitivity or discriminatory power among sites and consistency or reproducibility among results within a site. The Hester-Dendy method showed greater sensitivity and reproducibility compared with the kick-net method and the former method was therefore judged to be a more appropriate method for the Agency’s needs. However, the two assessment methods yielded comparable results for sites that were significantly impaired. Thus, using a PBMS approach, Ohio EPA could rely on the results of the volunteer monitoring method for sites that were judged as impaired.

Suggested Approach for Defining Performance Characteristics for Bioassessment Methods

Bioassessments, regardless of the method, determine test site condition on the basis of some reference condition or reference sites. The bioassessment reference condition consists of carefully chosen sites or conditions that meet certain a priori criteria and is specific for a certain environmental stratum or regime (ecoregion, habitat, season).

An important first step towards defining performance characteristics for bioassessment methods is to examine the data collected by the method for a given reference condition. This has been done, in part, by several States, some USEPA programs, and the U.S. Geological Survey NAWQA Program. Within a given ecoregion, several reference sites are sampled that have appropriate habitat for the sampling gear, using a prescribed method. In some cases, sites are sampled more than once in a year so that a measure of temporal precision could be obtained for each metric and the assessment score as a whole. Measures for all reference sites within a given region are then compiled to derive the reference condition characteristics for that region. If this approach is used in different ecoregions, one can obtain quantification of several important performance characteristics: Precision for a given metric or assessment score across replicate reference sites within an ecoregion; Temporal precision for a given metric or score for reference conditions within an ecoregion; Bias of a given metric and (or) method owing to differences in ecoregions or habitats; Performance range of a given method across different ecoregions; Potential interferences to a given method that are related to ecoregional or habitat qualities; and relative precision of a given metric or score among reference sites in different ecoregions.

While sampling and evaluating reference sites is necessary to characterize bioassessment performance, it is not sufficient. We also need to know how a given method performs over a range of impaired conditions e.g., a method’s sensitivity to impairment. As discussed earlier in this paper, sites do not have known levels of impairment or analogous standards by which to create a calibration-curve for a given bioassessment method. In lieu of this limitation, sampling sites are chosen that have known stressors (i.e., urban runoff, metals, grazing, sediments, pesticides). Because different sites may or may not have the same level of impairment within a region (i.e., are not replicates), precision of a method in impaired sites is examined by taking and analyzing multiple samples from the same site.

Table 2 illustrates the process by which a bioassessment method would quantify the necessary performance characteristics using reference-condition and test-site data. Two different ecoregions or habitat types are assumed in this process. More habitats or ecoregions would improve determination of the performance range and biases for a given method. Five reference sites are assumed for each ecoregion. This is a compromise between effort and cost required on the one hand, and resultant statistical power gained on the other. More reference sites would further refine method precision, performance range, and possibly discriminatory power of the method. At least three reference sites in a given region should be considered a minimum to evaluate method precision. Given the usually wide variation of natural geomorphic conditions and landscape ecology, even within supposedly "uniform" ecoregions, it is desirable to examine 10 or more reference sites in a region (Barbour et al 1996 ).

A range of impaired sites within a region is suggested to sufficiently characterize a given method. It is important that impaired sites meet the following criteria: They are very similar in habitat and geomorphometry to the reference sites examined; they are clearly receiving some chemical, physical or biological stressor(s) and have for some time (months at least); and impairment is not obvious without sampling; that is, the sites should not be heavily impaired. Widely different assessment procedures typically yield the same interpretation at such sites. A much better test of method sensitivity, as well as its performance range, is to examine sites with some, but not severe, stressors present. Ideally, it is beneficial to examine several (3) test sites in different regions, each with different stressors present and (or) different levels of the same stressor. Such a sampling design would enable one to derive more precise estimates of the performance range and any biases of the method or its assessment scoring system due to the type of stressor or ecoregional characteristics.

Once performance characteristics are defined for each method, performance criteria can be established, as well as scientifically feasible DQOs. If one method, for example, yields greater variability (less precision) in the same measure or in assessment scores among reference sites within an ecoregion than another method, then the precision exhibited by the less variable method may be used to define a performance criterion for precision. A program or study can then require a method that meets that precision criterion, and the collecting agency can select an appropriate method with confidence. Another collection method that demonstrates similar or better precision than the criterion as demonstrated in the reference method, is comparable and data from the two methods can be used together with confidence.

In determining whether two collection methods give comparable results, note that method comparability is based, for the most part, on the relative magnitude of the variances in measurements within and between ecoregions. We explicitly are not basing comparability on the measurements themselves because different methods may have different metrics or scoring systems. In addition, some sampling methods may explicitly ignore certain taxonomic groups and metrics compared to other methods. However, if one is especially interested in comparing the same metric among different methods, this can be easily incorporated into the test design in Table 2 by comparing mean values for regional reference or sites test using a paired t-test or non-parametric equivalent.

Relative accuracy of each method is addressed to the extent that the test sites chosen are likely to be truly impaired on the basis of independent factors such as the presence of chemical stressors or suboptimal habitat features. A method that exhibits low data precision (high score variability) among ecoregional reference sites compared to another method suggests either uncertain method accuracy or poor selection of reference sites. For some program goals and DQO, some certainty in the results may be sacrificed if the method has other advantages, such as reduced costs and less effort to perform.

Another performance characteristic of interest to those using benthic macroinvertebrate bioassessments is the sensitivity or discriminatory power of the method; that is, how well does a given method detect marginally or moderately impaired sites? Actual mean scores of metric values are used to determine method sensitivity in the form of a ratio between impaired sites and the regional reference value. Because impairment can only be judged relative to a reference or attainable biological condition in the relative absence of stressors, the score or metric at the test impaired site is not an absolute value and must be related to the appropriate reference-condition value. A method that yields a larger ratio of test-site score to reference score (m/1, P/2, c/a1 or q/a2, Table 2) would indicate less discriminatory power or sensitivity; that is, the test site is perceived to be similar to or better than the reference condition and, therefore, not impaired. If, however, the intent is to screen many sites so as to prioritize "hot" spots or significant impairment problems in need of corrective management action, then a method that is inexpensive and quick and tends to show impairment when significant impairment is actually present (such as some volunteer monitoring methods) can meet prescribed DQOs with relatively little cost or effort. In this case, the DQOs dictate a low priority for discriminatory power (high Type I error rate) and a high priority for accuracy in the decision (low Type II error rate); that is, a truly impaired site has a high probability of being categorized as such.

Applicable performance range and bias are two other important performance parameters that relate directly to the overall utility of a given method and its comparability to other methods. The suggested framework (Table 2) defined these two performance characteristics by sampling in different ecoregions that have different physical habitat characteristics. A bioassessment method that shows a higher precision among reference sites in one ecoregion or hydrogeomorphic basin as compared with another ecoregion or basin type may be useful information for deciding where or when a given method should or should not be used. Similarly, a metric or score that exhibits a consistent bias related to certain measured habitat features would help a user decide the types of sampling situations in which a particular method may be appropriate.



The ideas presented here represent a distillation of many discussions with members from the Intergovernmental Task Force On Monitoring. We especially acknowledge the comments of Herb Brass (USEPA), Russ Sherer (SCDHEC), and Jeroen Gerritsen (Tetra Tech, Inc.) The authors dedicate this work to Russ Sherer, former co-chair of the Methods and Data Comparability Task Group under the ITFM who died on February 6, 1996.



American Society for Testing and Materials. 1993. Biological effects, 11.04 of Annual book of standards: American Society of Testing and Materials. 1598 p.

Peckarsky, B. 1984. Sampling the steam benthos, in Downing, J., and Regler, F., eds. A manual or methods for the assessment of secondary productivity in freshwater (2d ed.): Oxford, Blackwell Scientific Publications, IBP Handbook 19. 501 p.

U.S. Environmental Protection Agency. 1989. Short-term methods for estimating the chronic toxicity of effluents and receiving waters to freshwater organisms (2d ed.): Cincinnati, Ohio. U.S. Environmental Protection Agency, Office of Research and Development. EPA/600-4-89-001. 334 p.

U.S. Environmental Protection Agency. 1990. Methods for measuring the acute toxicity of effluents and receiving waters to aquatic organisms (4th ed.): Cincinnati, Ohio. U.S. Environmental Protection Agency, Office of Research and Development. EPA/600-4-90-027. 293 p.

U.S. Geological Survey. 1993. Methods for sampling fish communities as a part of the National Water-Quality Assessment Program. Report 93-104. Raleigh, NC. U.S. Geological Survey. 40 p.




Table 1. Progression of a Generic Bioassessment Field and Laboratory Method

and Corresponding Steps Requiring Performance Criteria Characterization




Examples of Performance Criteria


Sampling device Performance range—Efficiency in different habitat types or substrates

Bias—Exclusion of certain taxa (mesh size)

Interferences—Matrix or physical limitations (current velocity, depth)


Sampling method Performance range—Limitations in certain habitats or benthic substrates

Bias—Sampler (personnel) efficiency

Precision—Of metrics or measures among replicate samples at a site


Field sample processing (subsampling, transfer, preservation) Precision—Of measures among splits of subsamples

Accuracy—Of transfer process

Performance range—Of preservation and holding time



Laboratory sample processing (sieving, sorting) Precision—Among split samples

Accuracy—Of sorting method; equipment used

Performance range—Of sorting method depending on sample matrix (detritus, mud)

Bias—In sorting certain taxonomic groups or organism sizes


Taxonomic enumeration Precision—Split samples

Accuracy—Of identification and counts

Performance range—Dependent on taxonomic group and (or) density

Bias—Counts and identifications for certain taxonomic groups


Table 2. Recommended Process for Documentation of Performance Parameters

and Comparability of Two Different Bioassessment Methods

[Five reference sites are assumed in this layout, but one could have a minimum of three sites for each region]


Region 1

Region 2

Reference numbers 1-5

Impaired or test site

Reference numbers 1-5

Impaired or test site



Method 1,

mean variance

Method 2,

mean variance

Method 1

Method 2

Method 1,

mean variance

Method 2,

mean variance

Method 1

Method 2


1 S1

2 S2



a1 d1

a2 d2



Metricn Assessment score

X1 q1

X2 q2



b1 f1

b1 f2



The following comparisons refer to the parameters specified above and are designed to yield various performance characteristics of biological-field-collection method.

Compare s1 with s2 for a given metric to determine relative precision of the metric for the two methods and an unimpaired condition.

Compare s1 with d1 and s2 with d2 to determine how metric variability may change with a region. A relatively high variability in a given metric within a region or compared with another region for the same method would suggest a certain performance range and bias for the metric.

Compare m1 with p2 to determine discriminatory power of a given metric by using the two methods in region 1. A ratio closer to 1.0 would signify little difference in the metric between an impaired site and the reference condition in region 1 for that method. The utility of the metric would be questionable in this case. Do the same type of analysis by comparing c/a1 and q/a2 for region 2.

Compare m1 with c/a1 and p2 with q/a2 to determine relative discriminatory power, performance range, and bases of a given metric and sampling method across regions. A similar ratio across regions for a given metric may indicate the robustness of the method and the metric. A ratio near 1.0 in one region and not in another for a given method and metric would indicate possible utility limitations or a limited performance range for that metric.

Compare q1 with q2 and f1 with f2 to determine overall method variability at unimpaired sites in each region. High variability in the score for one method compared to another method in a given region would suggest lack of comparability and (or) different applicable data-quality operations for the two methods.

Compare q1 with f1 and q2 with f2 to determine relative variability in assessment scores in the two regions. A consistently low score variability for a given method across regional reference sites would suggest method rigor and potential sensitivity.

Compare resultant scores for a given method and region deleting apparently variable or insensitive metrics to determine metric redundancy and to determine relative discriminatory power at impaired sites.

Individual assessment scores for reference sites and impaired sites within each region can be compared between methods by using regression to determine if there is a systematic relation in scores between the two methods.