TECHNICAL APPENDIX O

PERFORMANCE-BASED METHODS SYSTEM FOR BIOLOGICAL COLLECTION METHODS--A FRAMEWORK FOR EXAMINING METHOD COMPARABILITY


Relations of Analytical Performance-Based Characteristics to Biological Systems

Historically, chemical analytical data have been considered to be more quantitative than ecological or toxicological data, and correspondingly greater emphasis has been placed on such quality-control aspects as precision and bias. Recently, many biological methods have been refined and standardized such that truly quantitative data are obtained, as well as certain quality-control characteristics. The two fields, however, may be fundamentally different in that an objective statement of method accuracy (defined below), which is usually available in chemical laboratory methods, may not be available for biological field methods; that is, although a given analytical method can be tested to see if it accurately measures the amount of an analyte (by means of spiking into clean water, for example), there are no such external standards by which to judge the accuracy of a given biological collection method or a given toxicological method. Scientists cannot presently devise a treatment or sample with known toxicity value (independent of the method used) or spike a water sample with an absolute level of toxicity. Similarly, we may not be able to devise a site with a known level of impairment (independent of the method used) or "spike" a system with a known level of impairment. Instead, biological testing and collection methods have often relied on deciding, a priori, that a particular method yielded "accurate" results (that is, the reference method) with which results of other methods were compared.

With the introduction of the concept of performance-based methods systems (PBMS) in laboratory testing, particularly for chemical analytical data, it is apparent that a similar framework may be useful for examining comparability of field and laboratory biological data-collection methods. For example, in evaluating sediment or solid-phase toxicity, the American Society for Testing and Materials (1993) and the U.S. Environmental Protection Agency (1990; written commun., 1994) have developed biological toxicity test methods that have certain known performance criteria. They are currently recommending a PBMS approach to evaluate such toxicity; modifications of the recommended procedures are acceptable if it is shown that the performance criteria, as set by the recommended reference procedure, are met. In this case, method comparability is achieved by meeting specific performance criteria, such as negative control organism survival, growth of control organisms, and test endpoint precision, that have been established for a "reference method" developed under a specific regulatory program (USEPA TSCA, FIFRA, NPDES). Thus, the concept of PBMS is used in some aspects of biological laboratory testing.

Components of the Performance-Based Methods System Approach

Several performance parameters must be characterized for a given method to utilize a PBMS approach. These parameters include method precision, bias, performance range, interferences, and matrix applicability. These parameters, as well as method accuracy, are typically demonstrated in analytical chemistry systems through the use of blanks, standards, spikes, blind samples, performance evaluation samples, and other techniques to compare different methods and eventually to derive a reference method for a given analyte. Many of these performance parameters are applicable to biological laboratory and field methods and other prelaboratory procedures as well. It is known that a given collection method is not equally accurate over all ecological conditions even within a general aquatic system classification (streams, lakes, estuaries). Therefore, assuming a given method is a "reference method" on the basis of regulatory or programmatic reasons does not allow for possible translation or sharing of data derived from different methods because the performance characteristics of different methods have not been quantified. Furthermore, most biological methods have not had adequate analysis to provide a "crosswalk" to allow interpretation of results between different protocols. The following section draws parallels between aspects of PBMS developed for laboratory analytical chemistry methods and biological laboratory methods. The subsequent section discusses biological field methods.

Several conceptual similarities exist between chemical and biological laboratory methods with respect to quality-assurance (QA) concepts and method-comparability issues (table 1). In this section, many significant parallels are drawn between analytical and biological laboratory methods within the context of PBMS. Several performance parameters essential to a PBMS framework will be considered below.


Table 1. Translation of some performance criteria, derived for laboratory analytical
systems, to biological laboratory systems

Performance criteria Analytical chemistry methods Biological methods
Precision Duplicate and replicate samples Multiple taxonomic identifications of one sample; split
sample for sorting, identification, enumeration;
multiple subsamples.
Bias Spiked samples; standard Taxonomic reference samples;
reference materials; performance "spiked" organism samples.
evaluation samples.
Performance range Standard reference materials at Efficiency sorting procedures under different sample
various concentrations; evaluation conditions.
of spiked samples by using
different matrices.
Interferences Knowledge of chemical reactions Detrital material, mud in sorting animals; identification
involved in procedure; spiked of young life stages; taxonomic uncertainty.
samples; procedural blanks Method detection limit Standards, instrument calibration Organism-spiked samples; level of identification.
Table 2. Examples of laboratory Quality-Assurance design requirements
for reduction of probability of error.
  

Protocol component Design requirement

Subsampling Proper equipment
Training.
Standard operating procedures. Proper laboratory facilities. Proper oversight supervision. Taxonomy Proper training.
Up-to-date literature. Adequate dissecting microscope. Adequate compound microscope. Reference collection. Voucher collection. Predetermined taxon-specific level of identification. Proper oversight supervision (by a skilled scientist).

Precision

Laboratory chemistry systems measure method precision through the use of replicate sample measurements over a range of analyte concentrations. High replicability or reproducibility of a given sample measurement indicates high method precision. High method precision is clearly an important criterion for any method because this ensures reproducible results and increases statistical power of inference testing in intersample comparisons. Discrimination among samples is more likely with a method that has high precision.

Precision is an important performance parameter for biological aquatic toxicity testing as well. Similar to laboratory chemical testing, precision is measured by examining replicate measures of a given biological endpoint (for example, number surviving, growth, number of offspring produced) in which certain reference materials (sodium chloride, copper sulfate, cadmium chloride, sodium pentachlorophenol) are used. In chemical testing, precision is increased by modifying the instrumentation of the method or reagent modifications and through the use of calibration methods. To increase precision of a method in toxicity testing, an analogous procedure is used. Some method modifications used to increase the precision of a method in toxicity tests include the development of a more consistent, reliable food source in chronic toxicity testing (such as in the 7-day Ceriodaphnia survival and reproduction test); development of a standard dilution or control laboratory water (U.S. Environmental Protection Agency, 1990); and improved organism culturing techniques to ensure adequate organism health and consistent genetic composition within a given test (U.S. Environmental Protection Agency, 1989). A method that has a lower test precision relative to a published or programmatic method by using the same species and endpoint (defined as the reference method by the given program), is generally regarded as less useful, although other criteria may come into play [U.S. Environmental Protection Agency, 1990; J., Diamond, T., Abramson, and D., Reish, Tetra Tech, Inc., written commun. (ASTM E-47.01), 1994].

Laboratory methods for processing of biological field samples and capturing raw data also are concerned with method precision. For example, laboratory operations have distinct components that can have associated quality assurance program activities (table 2). Two component laboratory procedures for benthic macroinvertebrate sampling programs include subsampling and taxonomy. Subsampling is performed with preserved samples in the laboratory in this example. QA-design requirements do not differ between performing subsampling in the field and the laboratory, although adverse weather conditions could interfere with field-subsampling methods. Table 2 presents QA-design requirements for laboratory taxonomy to the genus or species level, although lower level taxonomy (that is, family) can be performed in the field by an experienced taxonomist.


Table 3. Examples of laboratory quality component routines that can
be used for benthic macroinvertebrate samples

Protocol component Data quality Characterization component
Subsampling Precision Compare metric values between split samples and (or) replications. Taxonomy do. Multiple identifications by different
taxonomists on single, randomly selected sample. Accuracy Achieved by expert verification or comparison with reference collection. Subsampling do. Recheck of sample residue for missed specimens. Bias Randomly selected grid squares; specimens removed to end of grid.
Precision, accuracy, and bias are characterized in biological laboratory analyses of field-collected samples through a variety of mechanisms (table 3). Not unlike chemical laboratory methods, biological methods rely on replicate measures to characterize precision and accuracy. Although method precision is recognized as a basic requirement of biological collection methods, few laboratory methods have actually documented precision or accuracy estimates.

Bias

The degree to which there is bias in a given laboratory analytical method is defined through the use of spiked or fortified samples, standard reference materials, and performance-evaluation samples. A similar process is utilized to detect bias in biological toxicity testing. For example, reference-toxicant- and blind-performance-evaluation samples are routinely used to detect possible bias or procedural problems with a given test method and biological endpoint (U.S. Environmental Protection Agency, 1990). However, unlike analytical chemistry testing, the biological toxicity test result is compared with a range of "normal " values generated by multiple laboratories that used quality control charts and repeated testing over an extended time period. The "true," or theoretical, value for a given method and toxicant is determined by a consensus of different laboratories that perform the test and is not a truly independent standard as it is in analytical testing. Thus, method bias in toxicity testing is a relative criterion. For example, samples that have low toxicity when the Daphnia magna acute toxicity test method and survival as the endpoint are used (U.S. Environmental Protection Agency, 1990) show greater intralaboratory and interlaboratory variability and bias than samples that contain a higher toxicant concentration. The USEPA has used a similar QA program as part of their discharge monthly report (DMR) studies. In this case, method bias is related to the consensus of participating laboratories and varies somewhat over the range of toxicity present. Method bias also may be related to the type of toxicant as well (for example, copper sulfate as compared with sodium chloride), although this has not been quantified at this time.

Bias in laboratory processing of field-collected samples has been assessed by using techniques similar to chemical and toxicological testing. This is a performance criterion that has received increasing consideration in biological laboratory QA procedures (U.S. Environmental Protection Agency, written commun., 1994). For example, taxonomic and enumeration bias of plankton or macroinvertebrate samples can be determined by "spiking" blind samples with organisms of known identification and then submitting them to the routine sample-processing procedure. Similarly, performance-evaluation samples could be derived that contain known taxonomic composition and are processed along with actual field samples. Several types of laboratory procedures can be evaluated in this way. Positively identified macroinvertebrates can be added to a synthetic sample that has water, detritus, and no macroinvertebrates to evaluate bias in sorting, as well as taxonomic identification procedures. Alternatively, after sorting macroinvertebrates, the sample residue can be resorted to quantify the number and types of organisms missed or underestimated in typical sorting procedures. Clearly, the above procedures are applicable only for samples that are brought back to the laboratory for processing. Data that are collected in the field only, such as many fish identifications/enumerations, habitat information, or certain physicochemical measurements, require similar performance-parameter characterization but need to be handled differently. Biological field methods of this type are covered later in this technical appendix. Field methods, in general, are treated in Technical Appendix N. Further documentation of bias is needed for many biological methods to evaluate method comparability adequately.

Performance Range and Interferences

To evaluate the usefulness of a given method or protocol and to define comparability between or among methods, the method's performance over a range of conditions must be known. Toxicology has used this concept to express certain test-acceptability criteria. Most of these criteria are driven by the biological requirements of the test species used. For example, a toxicity test in which rainbow trout are studied has a prescribed temperature range that considers the natural thermal limits of this species, thus reducing this source of interference. Similar constraints may be imposed for other physical and chemical water-quality characteristics, such as pH, hardness, and osmotic pressure, or grain size in the case of solid phase tests (American Society for Testing and Materials, 1993). There is some debate as to whether performance range and interferences are explicitly acknowledged and measured for many toxicity test methods. For example, a given sediment sample may appear to be toxic owing to an inappropriate grain size for the test species that would be indistinguishable from a true chemical toxicity effect. Similarly, a waste-water effluent may appear to be toxic owing to suboptimal osmotic pressure or nutrient balance that would be indistinguishable from chemical toxicity. Most American Society for Testing and Materials methods discuss potential interferences for each method. For other programs, however, this is an issue that appears to be dealt with in the context of programmatic or regulatory necessities rather than in the context of PBMS.

Biological laboratory procedures also are very much subject to the type of performance range and interferences. For example, certain macroinvertebrate sorting procedures were developed, in part, to reduce bias and interferences that result from detrital material or certain sediments present. However, certain taxonomic classifications (that is, species) may be inappropriate or unknown for some groups of organisms owing to limited knowledge and lack of identification procedures. Similarly, young life stages of many species (whether examined in the field or in the laboratory) are difficult to identify, thus posing a potential interference. Although some aspects of performance range and interferences have been identified for certain biological laboratory methods, to a large extent, these need to be documented.

Multimedia Applicability

The media or matrix of the sample can have a profound effect on method accuracy and precision. Similar to analytical chemistry testing, biological toxicity testing handles this issue by providing different procedures for different matrices--aquatic vs. solid phase and freshwater vs. estuarine vs. marine conditions. However, finer aspects of the matrix or media are not necessarily acknowledged in toxicity test methods. For example, the presence of suspended solids could represent a potential interference for some test species and pose a media problem.

In biological laboratory work, certain macroinvertebrate sample-processing procedures may be most accurate and precise if samples are collected from certain types of benthic substrates. For example, sorting efficiency and accuracy can be profoundly affected by the type of substrate collected and the abundance of detrital material. Although this issue is well-known to biologists, many biological laboratory methods have not explicitly quantified matrix applicability for given sample processing procedures.

Biological Field Methods

The primary difficulty with these precision measures is that they also are dependent on the precision of laboratory methods used. This is a common problem with many prelaboratory methods because prelaboratory performance is based on a laboratory-defined endpoint. In these cases, the only way to compare performance parameters, such as precision or interferences for different prelaboratory methods, is to keep the laboratory methods constant. Unfortunately, this type of comparison has rarely been done for any prelaboratory methods. Examples might include the USGS nutrient preservation study and the U.S. Forest Service (USFS) macroinvertebrate laboratory analysis method.

Field biological collection methods could benefit from a PBMS approach. Indeed, many performance parameters, which are common to any PBMS approach, have been addressed to some extent and are informally recognized during development of specific biological field methods. Better quantification of performance parameters for different methods could provide a useful framework with which to judge method comparability.

To demonstrate the usefulness of PBMS, precision is taken as an example of a performance parameter. Method precision could pertain to many aspects or subprocedures used in biological assessments. For example, interest could be in precision with respect to specific metrics at a given site by using replicate samples taken from the site. Alternatively, concern may be with precision in terms of specific metrics across reference sites in a given ecoregion and within a specific stream reach classification. Finally, interest may be in precision with respect to an assessment score among replicate samples at a site or among reference sites.

By establishing relative field method precision among methods, it is possible to derive a precision criterion, to designate a reference method that meets this criterion, and thereby to quantify method comparability. Other performance criteria, such as performance range, potential interferences, and matrix applicability, also would be used to quantify biological field-method comparability. Some of this information is published, but much of this knowledge is incorporated in an informal manner and not quantified within the framework of the method itself. As an example, several published sources discuss advantages and disadvantages of different sampling devices, such as various nets, dredges, bottle samplers, and appropriate environmental conditions for which these devices should be used; for example, Burton (1992) for sediment collection, Pederasty (1984) for macroinvertebrates; and Bryan (1984) for fish. Such information should be quantified for field methods to judge method comparability better. The form would depend on the particular procedural step as shown in table 4. To define a reference method for a given biological field procedure, it is imperative that the specific range of environmental conditions are quantitatively defined. For example, in macroinvertebrate bioassessment methods, performance range has been addressed qualitatively by considering the size of the stream, its specific hydrogeomorphic reach classification, and general habitat features (riffle areas, shallow depth). Such factors as current velocity, stream depth, and substrate size have been quantified or characterized to specify the range of conditions over which a particular method yields a certain level of precision and bias. Different methods then could be classified according to their applicable performance range, and further aspects of method comparability could be determined by examining preestablished performance criteria.


Table 4. Progression of a generic bioassessment fi4eld and laboratory method and corresponding steps
requiring performance criteria characterization

Step Procedure Examples of performance criteria
1 Sampling device Performance range--Efficiency in different habitat types
Bias--Exclusion of certain taxa
Interferences--Matrix or physical limitations 2 Sampling method Performance range--Limitations in certain habitats or matrices
Bias--Sampler (person) efficiency 3 Field sample processing Precision--Of measures among splits of subsamples
(subsampling, transfer, Accuracy--Of transfer process
Performance range-Of preservation and holding time 4 Laboratory sample processing Precision--Among split samples
(sieving, sorting) Accuracy--Of sorting method; equipment used
Performance range--Of sorting method dependent on sample matrix
Bias--Of sorting certain taxonomic groups or organism sizes 5 Taxonomic enumeration Precision--Split samples
Accuracy--Of identification/counts
Performance range--Dependent on taxonomic group and (or) density
Bias--Counts and identifications

Derivation of Performance Criteria to Evaluate Bioassessment Method Comparability

In performing biological field methods or any prelaboratory method, two fundamental concerns are of interest--that the sample taken and analyzed is representative of the site or the population of interest and that the data obtained are an accurate reflection of the sample collected and analyzed. The first concern is addressed through appropriate field sampling and protocols procedures (including site selection, sampling device, sample preservation) that are dictated, to a certain extent, by the data-quality objectives (DQO's). The second concern is addressed by using appropriate laboratory or analysis/protocols procedures. This is conducive to a PBMS approach because it is some-what analogous to a laboratory analytical chemistry PBMS--performance parameters, such as accuracy, precision and bias, can be quantified as discussed earlier.

The concern of sample representativeness for biological field methods is a complex one that will involve many components, each with its own set of performance parameters (table 4). For clarity, it may be best to subdivide a field-collection procedure into several compartments; for example, sampling/reference-site selection sampling device(s), sampling method, field subsampling/processing, and sample preservation/transport/storage (fig. 1). Many variations of each component may be in use. For example, in benthic macroinvertebrate assessments, several different methods or submethods are used, even for the same type of field sites (table 5).

What constitutes a representative sample has been debated for many field situations. Indeed, representativeness itself is dependent, in part, on the DQO's and what, when, and how a measurement is taken. For example, it is well established that many benthic samples may be needed from a stream bottom to obtain reasonable 95-percent confidence intervals for macroinvertebrate density, whereas few benthic samples may be needed to characterize species richness in a given habitat type (U.S. Environmental Protection Agency, 1989); thus, there is more assurance that a representative sample has been obtained if the number of species desired are present compared with the number of individuals per unit area. For many types of sampling equipment and habitat conditions, power analyses have been performed. This type of information needs to be collated and synthesized with similar information for other aspects of field sampling (tables 4, 5).


Figure 1. Flow diagram of a typical biassessment methodology in the context of performance-based methods system.

One way to judge sample representativeness is to examine the precision of a given measure or metric by analyzing multiple collections from the same location by using the same collection and processing procedures. If the measure of interest displays an unacceptable degree of variability among replicates (as determined by the DQO's), then sampling methods and (or) processing procedures may need to be modified. The USGS National Water Quality Assessment (NAWQA) Program (U.S. Geological Survey, 1993) examined this issue in setting up their stream sampling program.

In the case of biological collection methods, many measures or metrics are potentially available for the same sample. Together, these measures may form an index or score and, eventually, a narrative rating of status (fig. 2). Certain measures, such as density, may exhibit considerable variability among replicate samples, while other measures, such as species and richness measures, may not. This information could be used to determine which measures or metrics should be examined by using a given sampling protocol and DQO's.


Figure 2. Data manipulation hierarchy of field-collected biological samples.


For biological collection methods, method comparability could be determined if one knows how a particular metric of interest or assessment score behaves under different environmental conditions (impaired vs. reference sites, different habitat types, different seasons). Such information (obtained through repeated sampling at different times in the same location and sampling in different habitats and locations) would yield estimates of procedural bias, precision, interferences, and performance range (table 6).


Table 5. Benthic macroinvertebrate assessment for wadable streams: sample methodological
variations in the context of the performance-based methods system

Site/habitat sampled Collection procedure Field variations Preanalysis variations (for all field methods)
All available habitats Kick net Period of kicking Subsampling methods (riffles, pools, flats, Intensity of kicking Number of grids and so forth) or Net mesh size Number of organisms. riffles only. Number of kicks per site No subsampling. Colonization baskets Mesh size Taxonomic level: Colonization time Genus/species. Number of baskets per site Family. Media in baskets Varies with group. Hester-Denty Number of plates per site Use of tissue dyes. Colonization time Riffle areas only Surber Period of substrate Sieve size/screens. Handling Intensity of handling Number of samples per site Hess Period of substrate Sorting procedures: Handling Sucrose gradient. Intensity of handling Other. Number of samples per site Common to all procedures Sample container Size Transfer of sample to containers
Table 6. Examples of ways in which various performance criteria could be
addressed for biological collection methods

Performance criteria Example of method requirement
Precision Multiple reference sites; multiple samples within a site. Bias Reference "test" sites that provide consistent results. Performance range Reference sites in different hydrogeomorphic regions; sampling different habitat types; efficiency of sampling device under different habitat conditions. Interferences Knowledge of sampling device performance range; reference condition results; organism instar/size, sexual maturity--sampling index period. Multimedia applicability Performance range of sampling device; applicability of metrics to different regions, habitats.

Data Quality

Objectives of the data users will define which measure(s) and what environmental conditions should be used to determine comparability among methods. DQO's also will dictate how similar certain performance parameters need to be to consider two methods, and the data obtained, comparable. It is quite possible that two methods may be very comparable for certain measures of interest and not others. Knowing this, one could use data for those measures where different methods are comparable. This is the advantage of using a PBMS approach. The key is that performance characteristics are defined for each method and that the data user has access to comparability information when reviewing the data.

As mentioned above, many data levels are often available within a typical biological assessment (fig. 2). In addition to comparing certain metrics or indices among methods, it is possible (and sometimes necessary) to compare assessments or ratings. This is especially useful when the field-collection and the laboratory-analysis methods vary among two different procedures such that the two methods do not share specific metrics or indices in common. The most accessible procedure for comparing bioassessment methods is a side-by-side examination of assessment results [D. Lunate, North Carolina Department of Environmental Management, written commun., 1993; Indicators Task Group, written commun. (Draft Issue Paper), 1994]. A discussion of assessment comparability based on stream benthic macroinvertebrate and fish sampling is provided in the ITFM Indicators Task Group (Draft Issue Paper) [written commun., 1994]. Relevant to the present discussion, this paper shows that the paramount performance parameters in assessments are sensitivity or discriminatory power and consistency or reproducibility. Assessments that have greater sensitivity and reproducibility are judged to be more reliable than other assessments. Another result relevant to this discussion is that two assessments may be comparable for some types of sites or levels of impairment and not others.

Defining Performance Criteria for Biological Collection Methods

Biological collection methods (like chemical collection methods) utilize test sites and sites that comprise a known reference condition or reference sites (Technical Appendix F). In many ways, the reference condition is analogous to a chemist's blank; it represents the biological condition when minimum impairment (that is, minimum anthropogenic stressor) is present. Clearly, the chemical blank is a highly controlled entity that is dependent on the matrix, the analyte, and the analytical method being used. Similarly, the biological method blank or reference condition consists of carefully chosen sites that meet certain a priori criteria and is specific for a certain environmental stratum or regime (ecoregion, habitat, season).

An important first step of any biological collection method is to characterize performance parameters by using a given reference condition. This has been done, in part, by several States, some USEPA programs, and the NAWQA Program. In several different ecoregions, reference sites were sampled by using a prescribed method. In some cases, sites were sampled in more than 1 year so that a measure of temporal precision would be obtained for each metric and the assessment score as a whole. Measures for all reference sites within a given region were then compiled to derive the reference-condition characteristics for that region. If this approach is used in different ecoregions, one can obtain quantification of several important performance parameters (table 6). The following specific issues can be addressed for a given field method in this way:

To examine comparability, the methods of interest need to be performed at the same reference sites and preferably at the same time (same seasons and similar conditions). The more reference sites mutually sampled, the better the test of comparability. If one method, for example, yields greater variability (less precision) in the same measure or in assessment scores among reference sites within an ecoregion than another method, then this might be a basis to define a performance criterion for precision. One can then determine method comparability and select an appropriate method, given certain DQO's.

The discussion thus far has been limited to reference sites and conditions. We still do not know how a given method performs over a range of impaired conditions. Unfortunately, we do not have available sites with different known levels of impairment or analogous standards by which to create a calibration curve for a given collection method. However, we can choose sites that have known stressors (urban runoff, metals, grazing, sediments, pesticides) and examine performance parameters for different methods at those sites. Because we cannot guarantee different sites with the same level of impairment within a region, we can examine precision of a method by taking and analyzing multiple samples from the same location.

To compare collection methods, we recommend using the raw metric values, composited multimetric scores, or percentage differences from reference values for each sample. One of the challenges in determining method comparability for bioassessments is that the endpoint or assessment scoring procedure may be intimately related to the type of field procedure used. Differences between methods may be reflected in the taxonomic level used to identify collected organisms and ultimately the actual metrics measured. The result is often a different scoring method to go along with the difference in sampling methods. This type of challenge is less common in analytical chemistry work. Prelaboratory methods (for example, sample collection, preservation) may be independent of the corresponding laboratory methods to a large degree; that is, different prelaboratory methods can then be subjected to the same laboratory analysis to compare prelaboratory methods. The discussion provided in the ITFM Indicators Task Group (Draft Issue Paper) [written commun., 1994] addresses this problem for bioassessments.

Figure 3 and table 7 show how two different methods could be compared by using reference-condition and test-site data. Two different ecoregions or habitat types are assumed in this layout. More habitats or ecoregions would improve determination of the performance range and biases for a given biological collection method. Five reference sites are assumed for each ecoregion; this is a compromise between effort and cost required and resultant statistical power. More reference sites (15 or more) would further refine method precision, performance range, and, possibly, discriminatory power. At least three reference sites in a given region should be considered to be a minimum to evaluate method precision. Given the usually wide variation of natural geomorphic conditions and landscape ecology, even within supposedly "uniform" ecoregions, it is desirable to examine 10 or more reference sites in a region (Technical Appendix F).


Figure 3. How two different field bioassessment methods could be examined to determine method comparability.


Table 7. Recommended process for documentation of performance parameters
and comparability of two different bioassessment methods


A range of impaired sites within a region is suggested to sufficiently characterize a given method. It is important that impaired sites meet the following criteria:

The first criterion is suggested to reduce potential interferences owing to habitat differences between the test site and the reference sites. In this way, the reference site will serve as a true blank as discussed earlier. If one wanted to assess comparability of collection methods to detect physical habitat impairment, then this could be done by examining sites with different habitat deficiencies (for example, siltation, channelization, or lack of riparian vegetation) and no chemical stressors.

The second criterion is necessary to ensure the likelihood that the test site is indeed impaired. As discussed previously, it may not be known a priori that a given site is impaired. In this sense, accuracy cannot always be guaranteed for biological field methods. By selecting sites with no stressors (that is, wilderness, protected watersheds), as well as sites with known stressors (as discerned through laboratory toxicity tests, for example, using those stressors), we can increase our ability to test the accuracy of a given method. Potential test sites might be a body of water that receives naturally high concentrations of chemical stressors, downstream of a point-source discharge known to contain toxic concentrations of pollutants, a water body that has been colonized by exotic "pest" species (for example, zebra mussel, grass carp), or downstream from a nonpoint-source pollutant (that is, sediment and nutrient enrichment from grazing). The test site must have measured data for the stressor(s) before biological sampling to document potential cause for impairment.

The third criterion is necessary to have a good test of comparability in terms of method sensitivity and performance range. A severely impaired site (that is, a site with a preponderance of one or two species or a site apparently devoid of aquatic life) is generally recognized as such with little or no formal sampling. This result was observed in comparing bioassessments [ITFM Indicators Task Group, written commun. (Draft Issue Paper), 1994]. Widely different assessment procedures typically yielded the same interpretation at such sites. A much better test of method sensitivity or detection limit, as well as its performance range, is to examine sites with some, but not severe, impairment present. To ensure that a given test site is somewhat, but not severely, impaired, one must rely on information that concerns the stressor(s) (second criterion). Ideally, it would be beneficial to examine several test sites in a given region, each with different stressors present and (or) different levels of the same stressor. Such a sampling design would enable the user to derive more precise estimates of the performance range and any biases of the method or its assessment scoring system.

Recommended Process for Documentation of Performance Parameters

Table 7 summarizes the suggested test design and recommended analyses that compose the process for documenting performance characteristics of a given method and the degree of data comparability between two or more methods. It should be stressed that the process outlined in table 7 is not one that needs to be implemented with every study. Rather, the process should be done programmatically at least once for every method to document the limitations and range of applicability of the methods. Performance characteristics, such as precision, bias, and performance range are quantified for a given biological collection of methods by sampling several (at least five) reference and test sites (nonreference sites) within at least three different ecoregions during the same time or index period (table 7). Thus, for developing performance characteristics for a given method, data from a total of at least 30 sites sampled within a brief time period (preferably within no more than a 2-week period) are needed. Performance characteristics are obtained by analyzing several properties of the data collected for a given method (table 7), which includes the within-ecoregion variability for a given metric or final score by using reference-site data for each ecoregion separately and among-ecoregion variability for a given metric or score by using reference site data from all ecoregions together. In addition, estimates of collection-method sensitivity or discriminatory power are obtained by comparing testsite data with reference site data within each ecoregion. The performance range of the method can then be defined by comparing the sensitivity of the method over the different ecoregions sampled. Once performance characteristics are defined for a given method, performance criteria can be established, as well as scientifically feasible data-quality objectives. As a result, a second collection method that demonstrates similar or better performance characteristics is able to meet the established performance criteria. Thus, the data generated by the second method are comparable to those generated by the first method, and data from the two methods can be used together with confidence.

In determining whether two collection methods give comparable results, note that method comparability is based, for the most part, on the relative magnitude of the reference site variances within and between ecoregions. We explicitly are not basing comparability on actual assessment scores because different methods may have different scoring systems. Likewise, we do not base method comparability on comparison of the actual metric values because some sampling methods may explicitly ignore certain taxonomic groups compared to other methods. However, if the user is especially interested in how different methods compare for a given metric, then this can be easily incorporated into the test design by comparing mean values for regional reference sites by using a paired t-test or nonparametric equivalent.

Although we do not base method comparability on the actual numeric scores because the true score is unknown, one may be able to detect a systematic relation of one method score with another method score by means of regression analyses by using data from this test design. If two methods show significant comparability based on similar performance parameters as discussed earlier, then it is possible to numerically relate scores of one method to the other. This situation would present a clear benefit of pursuing method comparability.

Actual mean scores or metric values are used in this test design only as a ratio between the impaired site and the regional reference value. This ratio is compared among methods to assess sensitivity and accuracy. Because impairment can only be judged relative to a reference or attainable biological condition in the absence of stressors, the score or metric at the impaired test site is not an absolute value and must be related to the appropriate reference-condition value.

Each method is described in the context of specific performance parameters, which include precision, bias, performance range, and sensitivity. Accuracy also is addressed to the extent that the test sites chosen are likely to be truly impaired on the basis of independent factors (presence of chemical stressors or suboptimal habitat features). A method that exhibits greater score variability among ecoregional reference sites may suggest less method precision in general. This would be translated as reduced certainty in the results of a given collection method. For certain DQO's, reduced certainty in the results may be satisfactory if the method has other advantages, such as reduced costs and short time to perform. The ITFM Indicators Task Group [written commun. (Draft Issue Paper), 1994] gives some basis to make these judgements and how to make such trade-offs.

The following example shows how two different methods can be compared with respect to different metrics or community measures for stream benthic macroinvertebrates. Both methods used the same sampling procedure and the same personnel at the same sites at the same times. The difference in the two methods pertained to the subsample sizes used for the laboratory and data analyses. In one method, a 100-organism random subsample was used, and in the other, a 300-organism random subsample was used. Table 8 summarizes the results of the two methods. Differences in metrics or scores between the two methods are expressed as relative percent differences (RPD). It is evident that certain measures or metrics exhibit more variation between the two methods than others; however, all RPD's are less than 25 percent, which suggests good agreement between the two methods. These data suggest that under the sampling conditions and with the personnel performing the study, both subsampling procedures yielded comparable results.


Table 8. Calculation of differences in Relative percent difference
between two different subsample sizes from the same sample

Metric Subsample Relative percent A B difference
Number of taxa 25 31 21.4 Hilsenhoff biotic index 4.4 4.5 0.2 Ratio of scrapers to filter collectors 36.7 32.4 12.4 Ephemeroptera, plecoptera, trichoptera/ chironomidae 75.9 80.8 6.3 Percent of contribution of dominant taxon 27.5 28.1 2.2 Ephermeroptera, plecoptera, trichoptera index 9 11 20 Shredders/total 9.3 7.7 18.8 Hydropsychidae/total trichoptera 92.3 94.1 1.9 Total score 34 34 0
Probably of greatest interest to those using biological collection methods and their results is the sensitivity or discriminatory power of the method; that is, how well does a given method detect marginally or moderately impaired sites? The suggested test design does not adequately address this question because only a few impaired sites are sampled for each region. However, if the test sites are carefully chosen (by using the second and third criteria discussed above), then one may have some indications of relative method sensitivity. A method that yields a larger ratio of test-site score to reference score would indicate less discriminatory power or sensitivity; that is, the test site is perceived to be similar to or better than the reference condition and, therefore, not impaired. If, however, the intent is to screen many sites to prioritize "hot" spots or significant impairment problems in need of corrective management action, then a method that is inexpensive and quick and tends to show impairment when significant impairment is actually present would be used. In this case, the DQO's dictate a low priority for discriminatory power and a high priority for accuracy in the decision; that is, a purportedly impaired site is truly impaired.

Applicable performance range and bias are two other important performance parameters that relate directly to the overall utility of a given method and its comparability to other methods. These two parameters are characterized by sampling in different ecoregions that, by definition, have different physical habitat characteristics. The results of a comparison of a method that shows a higher precision among reference sites in one ecoregion or hydrogeomorphic basin/watershed compared with another similar biological method may be useful information for deciding where or when a given method should or should not be used. Similarly, a metric or score that exhibits a consistent bias related to certain measured habitat features would help the user decide the types of sampling situations in which a particular method may be appropriate. Clearly, the true performance range of a given method is complicated by the fact that several subprocedures or methods compose a field protocol (fig. 1; tables 1, 4). Each subprocedure has its own performance range. In principle, the performance range of a collection method is best characterized by examining the results over a range of habitat types appropriate to the sampling device being used. Such an examination also would be more likely to reveal method biases that could affect method precision and sensitivity.

References

American Society for Testing and Materials, 1993, Biological effects, 11.04 in Annual book of standards: American Society of Testing and Materials, 1598 p.

Bryan, C., 1984, Warmwater streams techniques manual in fishes: Baton Rouge, La., American Fisheries Society, Southern Division, 117 p.

Burton, A., 1992, Sediment toxicity assessment: Boca Raton, Fla., Lewis Publishers, Inc., p. 37-66.

Peckarsky, B., 1984, Sampling the stream benthos, in Downing, J., and Regler, F., eds., A manual on methods for the assessment of secondary productivity in freshwater (2d ed.): Oxford, United Kingdom, Blackwell Scientific Publications, IBP Handbook 19, 501 p.

U.S. Environmental Protection Agency, 1989, Short-term methods for estimating the chronic toxicity of effluents and receiving waters to freshwater organisms (2d ed.): Cincinnati, Ohio, U.S. Environmental Protection Agency, Office of Research and Development, EPA-600-4-89-001, 334 p.

----1990, Methods for measuring the acute toxicity of effluents and receiving waters to aquatic organisms (4th ed.): Cincinnati, Ohio, U.S. Environmental Protection Agency, Office of Research and Development, EPA-600-4-90-027, 293 p.

U.S. Geological Survey, 1993, Methods for sampling fish communities as a part of the National Water-Quality Assessment Program: U.S. Geological Survey Report 93-104, 40 p.


Return to ITFM Report Appendixes Table of Contents

Please e-mail comments to lkendrix@usgs.gov
Last modified: Fri Nov 8 12:32:14 1996