With the introduction of the concept of performance-based methods systems (PBMS) in laboratory testing, particularly for chemical analytical data, it is apparent that a similar framework may be useful for examining comparability of field and laboratory biological data-collection methods. For example, in evaluating sediment or solid-phase toxicity, the American Society for Testing and Materials (1993) and the U.S. Environmental Protection Agency (1990; written commun., 1994) have developed biological toxicity test methods that have certain known performance criteria. They are currently recommending a PBMS approach to evaluate such toxicity; modifications of the recommended procedures are acceptable if it is shown that the performance criteria, as set by the recommended reference procedure, are met. In this case, method comparability is achieved by meeting specific performance criteria, such as negative control organism survival, growth of control organisms, and test endpoint precision, that have been established for a "reference method" developed under a specific regulatory program (USEPA TSCA, FIFRA, NPDES). Thus, the concept of PBMS is used in some aspects of biological laboratory testing.
Several conceptual similarities exist between chemical and biological laboratory methods with respect to quality-assurance (QA) concepts and method-comparability issues (table 1). In this section, many significant parallels are drawn between analytical and biological laboratory methods within the context of PBMS. Several performance parameters essential to a PBMS framework will be considered below.
Table 2. Examples of laboratory Quality-Assurance design requirements
Performance criteria Analytical chemistry methods Biological methods
Precision Duplicate and replicate samples Multiple taxonomic identifications of one sample; split
sample for sorting, identification, enumeration;
multiple subsamples.
Bias Spiked samples; standard Taxonomic reference samples;
reference materials; performance "spiked" organism samples.
evaluation samples.
Performance range Standard reference materials at Efficiency sorting procedures under different sample
various concentrations; evaluation conditions.
of spiked samples by using
different matrices.
Interferences Knowledge of chemical reactions Detrital material, mud in sorting animals; identification
involved in procedure; spiked of young life stages; taxonomic uncertainty.
samples; procedural blanks Method detection limit Standards, instrument calibration Organism-spiked samples; level of identification.
Protocol component Design requirement
Subsampling Proper equipment
Training.
Standard operating procedures. Proper laboratory facilities. Proper oversight supervision. Taxonomy Proper training.
Up-to-date literature. Adequate dissecting microscope. Adequate compound microscope. Reference collection. Voucher collection. Predetermined taxon-specific level of identification. Proper oversight supervision (by a skilled scientist).
Precision is an important performance parameter for biological aquatic toxicity testing as well. Similar to laboratory chemical testing, precision is measured by examining replicate measures of a given biological endpoint (for example, number surviving, growth, number of offspring produced) in which certain reference materials (sodium chloride, copper sulfate, cadmium chloride, sodium pentachlorophenol) are used. In chemical testing, precision is increased by modifying the instrumentation of the method or reagent modifications and through the use of calibration methods. To increase precision of a method in toxicity testing, an analogous procedure is used. Some method modifications used to increase the precision of a method in toxicity tests include the development of a more consistent, reliable food source in chronic toxicity testing (such as in the 7-day Ceriodaphnia survival and reproduction test); development of a standard dilution or control laboratory water (U.S. Environmental Protection Agency, 1990); and improved organism culturing techniques to ensure adequate organism health and consistent genetic composition within a given test (U.S. Environmental Protection Agency, 1989). A method that has a lower test precision relative to a published or programmatic method by using the same species and endpoint (defined as the reference method by the given program), is generally regarded as less useful, although other criteria may come into play [U.S. Environmental Protection Agency, 1990; J., Diamond, T., Abramson, and D., Reish, Tetra Tech, Inc., written commun. (ASTM E-47.01), 1994].
Laboratory methods for processing of biological field samples and capturing raw data also are concerned with method precision. For example, laboratory operations have distinct components that can have associated quality assurance program activities (table 2). Two component laboratory procedures for benthic macroinvertebrate sampling programs include subsampling and taxonomy. Subsampling is performed with preserved samples in the laboratory in this example. QA-design requirements do not differ between performing subsampling in the field and the laboratory, although adverse weather conditions could interfere with field-subsampling methods. Table 2 presents QA-design requirements for laboratory taxonomy to the genus or species level, although lower level taxonomy (that is, family) can be performed in the field by an experienced taxonomist.
Precision, accuracy, and bias are characterized in biological laboratory analyses of field-collected samples through a variety of mechanisms (table 3). Not unlike chemical laboratory methods, biological methods rely on replicate measures to characterize precision and accuracy. Although method precision is recognized as a basic requirement of biological collection methods, few laboratory methods have actually documented precision or accuracy estimates.
Protocol component Data quality Characterization component
Subsampling Precision Compare metric values between split samples and (or) replications. Taxonomy do. Multiple identifications by different
taxonomists on single, randomly selected sample. Accuracy Achieved by expert verification or comparison with reference collection. Subsampling do. Recheck of sample residue for missed specimens. Bias Randomly selected grid squares; specimens removed to end of grid.
Bias in laboratory processing of field-collected samples has been assessed by using techniques similar to chemical and toxicological testing. This is a performance criterion that has received increasing consideration in biological laboratory QA procedures (U.S. Environmental Protection Agency, written commun., 1994). For example, taxonomic and enumeration bias of plankton or macroinvertebrate samples can be determined by "spiking" blind samples with organisms of known identification and then submitting them to the routine sample-processing procedure. Similarly, performance-evaluation samples could be derived that contain known taxonomic composition and are processed along with actual field samples. Several types of laboratory procedures can be evaluated in this way. Positively identified macroinvertebrates can be added to a synthetic sample that has water, detritus, and no macroinvertebrates to evaluate bias in sorting, as well as taxonomic identification procedures. Alternatively, after sorting macroinvertebrates, the sample residue can be resorted to quantify the number and types of organisms missed or underestimated in typical sorting procedures. Clearly, the above procedures are applicable only for samples that are brought back to the laboratory for processing. Data that are collected in the field only, such as many fish identifications/enumerations, habitat information, or certain physicochemical measurements, require similar performance-parameter characterization but need to be handled differently. Biological field methods of this type are covered later in this technical appendix. Field methods, in general, are treated in Technical Appendix N. Further documentation of bias is needed for many biological methods to evaluate method comparability adequately.
Biological laboratory procedures also are very much subject to the type of performance range and interferences. For example, certain macroinvertebrate sorting procedures were developed, in part, to reduce bias and interferences that result from detrital material or certain sediments present. However, certain taxonomic classifications (that is, species) may be inappropriate or unknown for some groups of organisms owing to limited knowledge and lack of identification procedures. Similarly, young life stages of many species (whether examined in the field or in the laboratory) are difficult to identify, thus posing a potential interference. Although some aspects of performance range and interferences have been identified for certain biological laboratory methods, to a large extent, these need to be documented.
In biological laboratory work, certain macroinvertebrate sample-processing procedures may be most accurate and precise if samples are collected from certain types of benthic substrates. For example, sorting efficiency and accuracy can be profoundly affected by the type of substrate collected and the abundance of detrital material. Although this issue is well-known to biologists, many biological laboratory methods have not explicitly quantified matrix applicability for given sample processing procedures.
Field biological collection methods could benefit from a PBMS approach. Indeed, many performance parameters, which are common to any PBMS approach, have been addressed to some extent and are informally recognized during development of specific biological field methods. Better quantification of performance parameters for different methods could provide a useful framework with which to judge method comparability.
To demonstrate the usefulness of PBMS, precision is taken as an example of a performance parameter. Method precision could pertain to many aspects or subprocedures used in biological assessments. For example, interest could be in precision with respect to specific metrics at a given site by using replicate samples taken from the site. Alternatively, concern may be with precision in terms of specific metrics across reference sites in a given ecoregion and within a specific stream reach classification. Finally, interest may be in precision with respect to an assessment score among replicate samples at a site or among reference sites.
By establishing relative field method precision among methods, it is possible to derive a precision criterion, to designate a reference method that meets this criterion, and thereby to quantify method comparability. Other performance criteria, such as performance range, potential interferences, and matrix applicability, also would be used to quantify biological field-method comparability. Some of this information is published, but much of this knowledge is incorporated in an informal manner and not quantified within the framework of the method itself. As an example, several published sources discuss advantages and disadvantages of different sampling devices, such as various nets, dredges, bottle samplers, and appropriate environmental conditions for which these devices should be used; for example, Burton (1992) for sediment collection, Pederasty (1984) for macroinvertebrates; and Bryan (1984) for fish. Such information should be quantified for field methods to judge method comparability better. The form would depend on the particular procedural step as shown in table 4. To define a reference method for a given biological field procedure, it is imperative that the specific range of environmental conditions are quantitatively defined. For example, in macroinvertebrate bioassessment methods, performance range has been addressed qualitatively by considering the size of the stream, its specific hydrogeomorphic reach classification, and general habitat features (riffle areas, shallow depth). Such factors as current velocity, stream depth, and substrate size have been quantified or characterized to specify the range of conditions over which a particular method yields a certain level of precision and bias. Different methods then could be classified according to their applicable performance range, and further aspects of method comparability could be determined by examining preestablished performance criteria.
Step Procedure Examples of performance criteria
1 Sampling device Performance range--Efficiency in different habitat types
Bias--Exclusion of certain taxa
Interferences--Matrix or physical limitations 2 Sampling method Performance range--Limitations in certain habitats or matrices
Bias--Sampler (person) efficiency 3 Field sample processing Precision--Of measures among splits of subsamples
(subsampling, transfer, Accuracy--Of transfer process
Performance range-Of preservation and holding time 4 Laboratory sample processing Precision--Among split samples
(sieving, sorting) Accuracy--Of sorting method; equipment used
Performance range--Of sorting method dependent on sample matrix
Bias--Of sorting certain taxonomic groups or organism sizes 5 Taxonomic enumeration Precision--Split samples
Accuracy--Of identification/counts
Performance range--Dependent on taxonomic group and (or) density
Bias--Counts and identifications
The concern of sample representativeness for biological field methods is a complex one that will involve many components, each with its own set of performance parameters (table 4). For clarity, it may be best to subdivide a field-collection procedure into several compartments; for example, sampling/reference-site selection sampling device(s), sampling method, field subsampling/processing, and sample preservation/transport/storage (fig. 1). Many variations of each component may be in use. For example, in benthic macroinvertebrate assessments, several different methods or submethods are used, even for the same type of field sites (table 5).
What constitutes a representative sample has been debated for many field situations. Indeed, representativeness itself is dependent, in part, on the DQO's and what, when, and how a measurement is taken. For example, it is well established that many benthic samples may be needed from a stream bottom to obtain reasonable 95-percent confidence intervals for macroinvertebrate density, whereas few benthic samples may be needed to characterize species richness in a given habitat type (U.S. Environmental Protection Agency, 1989); thus, there is more assurance that a representative sample has been obtained if the number of species desired are present compared with the number of individuals per unit area. For many types of sampling equipment and habitat conditions, power analyses have been performed. This type of information needs to be collated and synthesized with similar information for other aspects of field sampling (tables 4, 5).
One way to judge sample representativeness is to examine the precision of a given measure or metric by analyzing multiple collections from the same location by using the same collection and processing procedures. If the measure of interest displays an unacceptable degree of variability among replicates (as determined by the DQO's), then sampling methods and (or) processing procedures may need to be modified. The USGS National Water Quality Assessment (NAWQA) Program (U.S. Geological Survey, 1993) examined this issue in setting up their stream sampling program.
In the case of biological collection methods, many measures or metrics are potentially available for the same sample. Together, these measures may form an index or score and, eventually, a narrative rating of status (fig. 2). Certain measures, such as density, may exhibit considerable variability among replicate samples, while other measures, such as species and richness measures, may not. This information could be used to determine which measures or metrics should be examined by using a given sampling protocol and DQO's.
Figure 2. Data manipulation hierarchy of field-collected biological samples.
For biological collection methods, method comparability could be determined if one knows how a particular metric of interest or assessment score behaves under different environmental conditions (impaired vs. reference sites, different habitat types, different seasons). Such information (obtained through repeated sampling at different times in the same location and sampling in different habitats and locations) would yield estimates of procedural bias, precision, interferences, and performance range (table 6).
Table 6. Examples of ways in which various performance criteria could be
Site/habitat sampled Collection procedure Field variations Preanalysis variations (for all field methods)
All available habitats Kick net Period of kicking Subsampling methods (riffles, pools, flats, Intensity of kicking Number of grids and so forth) or Net mesh size Number of organisms. riffles only. Number of kicks per site No subsampling. Colonization baskets Mesh size Taxonomic level: Colonization time Genus/species. Number of baskets per site Family. Media in baskets Varies with group. Hester-Denty Number of plates per site Use of tissue dyes. Colonization time Riffle areas only Surber Period of substrate Sieve size/screens. Handling Intensity of handling Number of samples per site Hess Period of substrate Sorting procedures: Handling Sucrose gradient. Intensity of handling Other. Number of samples per site Common to all procedures Sample container Size Transfer of sample to containers
Performance criteria Example of method requirement
Precision Multiple reference sites; multiple samples within a site. Bias Reference "test" sites that provide consistent results. Performance range Reference sites in different hydrogeomorphic regions; sampling different habitat types; efficiency of sampling device under different habitat conditions. Interferences Knowledge of sampling device performance range; reference condition results; organism instar/size, sexual maturity--sampling index period. Multimedia applicability Performance range of sampling device; applicability of metrics to different regions, habitats.
As mentioned above, many data levels are often available within a typical biological assessment (fig. 2). In addition to comparing certain metrics or indices among methods, it is possible (and sometimes necessary) to compare assessments or ratings. This is especially useful when the field-collection and the laboratory-analysis methods vary among two different procedures such that the two methods do not share specific metrics or indices in common. The most accessible procedure for comparing bioassessment methods is a side-by-side examination of assessment results [D. Lunate, North Carolina Department of Environmental Management, written commun., 1993; Indicators Task Group, written commun. (Draft Issue Paper), 1994]. A discussion of assessment comparability based on stream benthic macroinvertebrate and fish sampling is provided in the ITFM Indicators Task Group (Draft Issue Paper) [written commun., 1994]. Relevant to the present discussion, this paper shows that the paramount performance parameters in assessments are sensitivity or discriminatory power and consistency or reproducibility. Assessments that have greater sensitivity and reproducibility are judged to be more reliable than other assessments. Another result relevant to this discussion is that two assessments may be comparable for some types of sites or levels of impairment and not others.
An important first step of any biological collection method is to characterize performance parameters by using a given reference condition. This has been done, in part, by several States, some USEPA programs, and the NAWQA Program. In several different ecoregions, reference sites were sampled by using a prescribed method. In some cases, sites were sampled in more than 1 year so that a measure of temporal precision would be obtained for each metric and the assessment score as a whole. Measures for all reference sites within a given region were then compiled to derive the reference-condition characteristics for that region. If this approach is used in different ecoregions, one can obtain quantification of several important performance parameters (table 6). The following specific issues can be addressed for a given field method in this way:
The discussion thus far has been limited to reference sites and conditions. We still do not know how a given method performs over a range of impaired conditions. Unfortunately, we do not have available sites with different known levels of impairment or analogous standards by which to create a calibration curve for a given collection method. However, we can choose sites that have known stressors (urban runoff, metals, grazing, sediments, pesticides) and examine performance parameters for different methods at those sites. Because we cannot guarantee different sites with the same level of impairment within a region, we can examine precision of a method by taking and analyzing multiple samples from the same location.
To compare collection methods, we recommend using the raw metric values, composited multimetric scores, or percentage differences from reference values for each sample. One of the challenges in determining method comparability for bioassessments is that the endpoint or assessment scoring procedure may be intimately related to the type of field procedure used. Differences between methods may be reflected in the taxonomic level used to identify collected organisms and ultimately the actual metrics measured. The result is often a different scoring method to go along with the difference in sampling methods. This type of challenge is less common in analytical chemistry work. Prelaboratory methods (for example, sample collection, preservation) may be independent of the corresponding laboratory methods to a large degree; that is, different prelaboratory methods can then be subjected to the same laboratory analysis to compare prelaboratory methods. The discussion provided in the ITFM Indicators Task Group (Draft Issue Paper) [written commun., 1994] addresses this problem for bioassessments.
Figure 3 and table 7 show how two different methods could be compared by using reference-condition and test-site data. Two different ecoregions or habitat types are assumed in this layout. More habitats or ecoregions would improve determination of the performance range and biases for a given biological collection method. Five reference sites are assumed for each ecoregion; this is a compromise between effort and cost required and resultant statistical power. More reference sites (15 or more) would further refine method precision, performance range, and, possibly, discriminatory power. At least three reference sites in a given region should be considered to be a minimum to evaluate method precision. Given the usually wide variation of natural geomorphic conditions and landscape ecology, even within supposedly "uniform" ecoregions, it is desirable to examine 10 or more reference sites in a region (Technical Appendix F).
A range of impaired sites within a region is suggested to sufficiently characterize a given method. It is important that impaired sites meet the following criteria:
The second criterion is necessary to ensure the likelihood that the test site is indeed impaired. As discussed previously, it may not be known a priori that a given site is impaired. In this sense, accuracy cannot always be guaranteed for biological field methods. By selecting sites with no stressors (that is, wilderness, protected watersheds), as well as sites with known stressors (as discerned through laboratory toxicity tests, for example, using those stressors), we can increase our ability to test the accuracy of a given method. Potential test sites might be a body of water that receives naturally high concentrations of chemical stressors, downstream of a point-source discharge known to contain toxic concentrations of pollutants, a water body that has been colonized by exotic "pest" species (for example, zebra mussel, grass carp), or downstream from a nonpoint-source pollutant (that is, sediment and nutrient enrichment from grazing). The test site must have measured data for the stressor(s) before biological sampling to document potential cause for impairment.
The third criterion is necessary to have a good test of comparability in terms of method sensitivity and performance range. A severely impaired site (that is, a site with a preponderance of one or two species or a site apparently devoid of aquatic life) is generally recognized as such with little or no formal sampling. This result was observed in comparing bioassessments [ITFM Indicators Task Group, written commun. (Draft Issue Paper), 1994]. Widely different assessment procedures typically yielded the same interpretation at such sites. A much better test of method sensitivity or detection limit, as well as its performance range, is to examine sites with some, but not severe, impairment present. To ensure that a given test site is somewhat, but not severely, impaired, one must rely on information that concerns the stressor(s) (second criterion). Ideally, it would be beneficial to examine several test sites in a given region, each with different stressors present and (or) different levels of the same stressor. Such a sampling design would enable the user to derive more precise estimates of the performance range and any biases of the method or its assessment scoring system.
In determining whether two collection methods give comparable results, note that method comparability is based, for the most part, on the relative magnitude of the reference site variances within and between ecoregions. We explicitly are not basing comparability on actual assessment scores because different methods may have different scoring systems. Likewise, we do not base method comparability on comparison of the actual metric values because some sampling methods may explicitly ignore certain taxonomic groups compared to other methods. However, if the user is especially interested in how different methods compare for a given metric, then this can be easily incorporated into the test design by comparing mean values for regional reference sites by using a paired t-test or nonparametric equivalent.
Although we do not base method comparability on the actual numeric scores because the true score is unknown, one may be able to detect a systematic relation of one method score with another method score by means of regression analyses by using data from this test design. If two methods show significant comparability based on similar performance parameters as discussed earlier, then it is possible to numerically relate scores of one method to the other. This situation would present a clear benefit of pursuing method comparability.
Actual mean scores or metric values are used in this test design only as a ratio between the impaired site and the regional reference value. This ratio is compared among methods to assess sensitivity and accuracy. Because impairment can only be judged relative to a reference or attainable biological condition in the absence of stressors, the score or metric at the impaired test site is not an absolute value and must be related to the appropriate reference-condition value.
Each method is described in the context of specific performance parameters, which include precision, bias, performance range, and sensitivity. Accuracy also is addressed to the extent that the test sites chosen are likely to be truly impaired on the basis of independent factors (presence of chemical stressors or suboptimal habitat features). A method that exhibits greater score variability among ecoregional reference sites may suggest less method precision in general. This would be translated as reduced certainty in the results of a given collection method. For certain DQO's, reduced certainty in the results may be satisfactory if the method has other advantages, such as reduced costs and short time to perform. The ITFM Indicators Task Group [written commun. (Draft Issue Paper), 1994] gives some basis to make these judgements and how to make such trade-offs.
The following example shows how two different methods can be compared with respect to different metrics or community measures for stream benthic macroinvertebrates. Both methods used the same sampling procedure and the same personnel at the same sites at the same times. The difference in the two methods pertained to the subsample sizes used for the laboratory and data analyses. In one method, a 100-organism random subsample was used, and in the other, a 300-organism random subsample was used. Table 8 summarizes the results of the two methods. Differences in metrics or scores between the two methods are expressed as relative percent differences (RPD). It is evident that certain measures or metrics exhibit more variation between the two methods than others; however, all RPD's are less than 25 percent, which suggests good agreement between the two methods. These data suggest that under the sampling conditions and with the personnel performing the study, both subsampling procedures yielded comparable results.
Probably of greatest interest to those using biological collection methods and their results is the sensitivity or discriminatory power of the method; that is, how well does a given method detect marginally or moderately impaired sites? The suggested test design does not adequately address this question because only a few impaired sites are sampled for each region. However, if the test sites are carefully chosen (by using the second and third criteria discussed above), then one may have some indications of relative method sensitivity. A method that yields a larger ratio of test-site score to reference score would indicate less discriminatory power or sensitivity; that is, the test site is perceived to be similar to or better than the reference condition and, therefore, not impaired. If, however, the intent is to screen many sites to prioritize "hot" spots or significant impairment problems in need of corrective management action, then a method that is inexpensive and quick and tends to show impairment when significant impairment is actually present would be used. In this case, the DQO's dictate a low priority for discriminatory power and a high priority for accuracy in the decision; that is, a purportedly impaired site is truly impaired.
Metric Subsample Relative percent A B difference
Number of taxa 25 31 21.4 Hilsenhoff biotic index 4.4 4.5 0.2 Ratio of scrapers to filter collectors 36.7 32.4 12.4 Ephemeroptera, plecoptera, trichoptera/ chironomidae 75.9 80.8 6.3 Percent of contribution of dominant taxon 27.5 28.1 2.2 Ephermeroptera, plecoptera, trichoptera index 9 11 20 Shredders/total 9.3 7.7 18.8 Hydropsychidae/total trichoptera 92.3 94.1 1.9 Total score 34 34 0
Applicable performance range and bias are two other important performance parameters that relate directly to the overall utility of a given method and its comparability to other methods. These two parameters are characterized by sampling in different ecoregions that, by definition, have different physical habitat characteristics. The results of a comparison of a method that shows a higher precision among reference sites in one ecoregion or hydrogeomorphic basin/watershed compared with another similar biological method may be useful information for deciding where or when a given method should or should not be used. Similarly, a metric or score that exhibits a consistent bias related to certain measured habitat features would help the user decide the types of sampling situations in which a particular method may be appropriate. Clearly, the true performance range of a given method is complicated by the fact that several subprocedures or methods compose a field protocol (fig. 1; tables 1, 4). Each subprocedure has its own performance range. In principle, the performance range of a collection method is best characterized by examining the results over a range of habitat types appropriate to the sampling device being used. Such an examination also would be more likely to reveal method biases that could affect method precision and sensitivity.
Bryan, C., 1984, Warmwater streams techniques manual in fishes: Baton Rouge, La., American Fisheries Society, Southern Division, 117 p.
Burton, A., 1992, Sediment toxicity assessment: Boca Raton, Fla., Lewis Publishers, Inc., p. 37-66.
Peckarsky, B., 1984, Sampling the stream benthos, in Downing, J., and Regler, F., eds., A manual on methods for the assessment of secondary productivity in freshwater (2d ed.): Oxford, United Kingdom, Blackwell Scientific Publications, IBP Handbook 19, 501 p.
U.S. Environmental Protection Agency, 1989, Short-term methods for estimating the chronic toxicity of effluents and receiving waters to freshwater organisms (2d ed.): Cincinnati, Ohio, U.S. Environmental Protection Agency, Office of Research and Development, EPA-600-4-89-001, 334 p.
----1990, Methods for measuring the acute toxicity of effluents and receiving waters to aquatic organisms (4th ed.): Cincinnati, Ohio, U.S. Environmental Protection Agency, Office of Research and Development, EPA-600-4-90-027, 293 p.
U.S. Geological Survey, 1993, Methods for sampling fish communities as a part of the National Water-Quality Assessment Program: U.S. Geological Survey Report 93-104, 40 p.
Return to
ITFM Report Appendixes Table of Contents