An Alternative Regression Method for Constituent Loads from Streams

Ping Wang, Water Resources Engineer

Maryland DNR

Lewis C. Linker, Modeling Coordinator

USEPA CBPO

 

Abstract

Three regression models, the 7-parameter Minimum Variance Unbiased Estimator (MVUE), a multi-variance classical Ordinary Regression Method (ORM), and a newly developed Unbiased Regression Method (URM), for fluvial loads from the Susquehanna and James Rivers are compared. The ORMs and URMs are both based on the 7-parameter log-linear regression equation, with some modifications to the flow and/or time parameters, but they differ in bias corrections. The ORMs apply an unbiased least-squares regression methodology, yielding unbiased log[load] estimates, but having bias in the load estimate due to log-to-normal transformation without bias correction. The ORMs generally have less standard error of estimate (SE) in daily loads than MVUE. The MVUE, having low total bias, is better in average load estimates in big time steps than ORM. The URMs apply the newly developed bias corrections, yielding a virtually unbiased load estimate (total bias < 0.005% in the regression period), and generally having lower SE than MVUE and ORM. For extrapolation to the regression in the extended period where the observations are not used to derive the regression, the ORMs generally have a lower SE than MVUE and URM. There is no unique method which is better than the others, nevertheless, URM and ORM can be alternative approaches, or supplemental tools, to the MVUE or other regression methods.

Introduction

Regression models have been applied extensively to estimate fluvial transport (Miller 1951, Bradu and Mundlak 1970, Cohn et al 1989, 1992, Preston et al 1989, Belval et al 1995, Wang et al 1998). A concentration - flow relationship based log-linear regression model, the Minimum Variance Unbiased Estimator (Cohn et al 1989, 1992), has been widely used to estimate sediment and nutrient loads to the Chesapeake Bay, USA (Cohn et al 1989, 1992, Belval et al 1995, Wang et al 1998).

Various regression models for stream loads have been established (Miller 1951, Bradu and Mundlak 1970, Cohn et al 1992, Preston et al 1989, Walling 1977). Most of them are log-linear regression differing primarily in their regressor terms. The log-linear regressions use log[load] or log[conc] as the estimator. The log[load] or log[conc] derived from the regression are then transformed from the log space to the normal space, usually by exponential of the log values directly. There are two types of bias associated with the log-linear regressions. Type-1 bias is due to the method used to derive the regression equation. Type-2 bias is due to the transformation of log-to-normal space from the regression results. Type-1 bias can usually be avoided by choosing an unbiased regression derivation, such as the least-squares method (under the classical statistical assumptions, the Gauss-Markov theorem (Grewal 1990)), as used in this paper. Therefore, this paper will only discuss Type-2 bias, the generation of Type-2 bias, and methods for bias correction. In this discussion the classical regressions without bias correction will be classified as the ordinary regression.

Bradu and Mundlak (1970) proposed a Minimum Variance Unbiased Estimator (MVUE) which involved a bias correction. Cohn et al. (1992) developed their MVUE (will be simply called the MVUE in this paper, because we will not discuss other MVUE regressions than Cohn’s MVUE), which had two improvements over the precedents’ log-linear regression for stream loads. One was the establishment of additional regressor terms known as the 7-parameter multi-variance log-linear regression. These additional terms accounted for seasonality and long term load trends. Another improvement was by introducing a bias correction. Cohn et al (1992) compared the MVUE results with the rating curve method (RC) and quasi maximum likelihood estimator (QMLE). The MVUE had lower total bias (about 1-6% for Susquehanna, in comparison to 0%- -24% in RC or -6% in QMLE).

Theoretically MVUE is an unbiased estimator, however, it is sensitive to the assumption of normality in the log space (Thomas, 1988; Duan 1983, Cohn et al 1992) . It may be a biased estimator if the distribution is not lognormal (Gilbert 1987). Although MVUE was successful in reducing biases compared to previous studies, there are still certain degrees of bias in the computational outputs, which may be as low as 1-6% for the Susquehanna, or higher in other rivers such as the Choptank, 4-12%, and the Potomac, 8-26%. The standard deviations had no virtual differences among the three comparing methods (Cohn et al 1992, Table 5). Therefore, in this study a new approach in bias correction was developed, which may produce a minimal calculated bias and low standard error of estimate.

The ordinary regression without bias correction using Cohn et al. (1992) 7-parameter regressors has shown better results than the RC and QMLE (which had fewer regressors). Therefore, this study was primarily based on the 7-parameter regression equation, which is simply called ORM if no bias collection is involved (to distinguish from MVUE, or RC or QMLE). The newly developed bias correction method developed in this project was applied to the ORM outputs, which is called URM (Unbiased Regression Method). In addition to the use of Cohn’s 7 parameter equation, two other equations which were modified from Cohn’s were used. Both were applied with ORM and URM. In order to distinguish these methods, the three regressions without bias correction are called ORM1, ORM2, and ORM3, and the corresponding regressions with bias correction are called URM1, URM2, and URM3. Detail equations can be find in the text.

This paper will present our approaches, discuss the correlation of load with flow and time, and then present the regression equations setup, regression results, and the comparison among MVUE, ORMs, and URMs.

Methodology

In accordance to most published regression methods for fluvial loads, the regression methods for load estimation are based on known daily mean flows and known discrete concentrations (or loads) to predict daily constituent loads. The observed concentrations are assumed to represent the average values on the reported dates. All of the regression are based on the classical statistical assumptions.

The regressors were based on the multi-variance regression equation of Cohn et al. (1992), and its modifications. The estimator-regressors correlation was analyzed with observed data using simple graphics. For ORM the estimator term was log[load] (log[conc] for MVUE) in a selected period (e.g., 1986-1990); some regressor terms were modified from MVUE; new unbiased correction methods (URM), were applied over ORM. We applied the same data with MVUE, ORM and URM. The outputs were the predicted daily loads in an expanded period (e.g., 1984-1992, with the extended periods of 1984-1985 and 1991-1992) according to known daily flows. In order to evaluate these methods the daily loads from ORM, URM and MVUE were compared with observed data.

In the derivation of regression equations for ORM and URM, the principle of least-squares method was applied to produce estimates that were the Best Linear Unbiased Estimates (BLUE) under the classical statistical assumptions. Type-2 bias will result in the transformation from a log space to a normal space for ORM. The software, Estimator_94 (kindly provided by the US Geological Survey), was used for MVUE regression.

Note: Here, an approach different from Cohn et al. (1992) was utilized in evaluating regression methods. Cohn et al. applied Thomas’ (1988) split-sampling study to select subpopulations of 75 samples from several hundreds of observations in 9 years (1980-1988). Split-sampling study is a good evaluation method for a large number of samples in which each randomly selected subpopulation is normally distributed. However, it is possible that some important data such as storm-flow samples, which occur less frequently, might not be selected in some subsets of samples. In fact, it was observed that the standard deviations (SD) in Cohn et al. (1992) were the same for the different methods for a specific constituent in a specific river. Therefore, in this study using a small number of sub-samples from rather irregularly varied samples was avoided to evaluate the regression methods which may require sufficient representative samples. Instead, all available samples from those years which have sufficient observations for regression were used.

This evaluation of regression results will consider standard error of estimate (SE) based on daily data, as well as the goodness in estimating yearly loads, in both the regression period (i.e., the years which observed data were used for regression) and the extended periods (i.e., the years which observed data were not used for regression, but used to check regression results). Surely, further analysis with Thomas (1988) split-sampling study may be useful to evaluate how well a regression method could be applied to various sampling conditions (including those sample sets deviating from the classical statistical assumptions), however, this is beyond the scope of this study.

Selection of Observed Data

Water quality data of the constituents, dissolved nitrate + nitrite (NO2-3), dissolved phosphate (PO4), total phosphorus (TP), total nitrogen (TN), total organic nitrogen (OrN), total Kjeldahl nitrogen (TKN), and sediments (Sed), from the Susquehanna River (Station 1578310), Maryland, and the James River (Station 2035000), Virginia, USA, in certain periods (which will be detailed in the following paragraphs) were used for this study. These observed data were from the USEPA STORET database, which in turn were mainly from the USGS water quality database. The data included concentrations of constituents and flows (Q) on specific dates (t). Note: some of the data was derived through simple calculations with the assumptions of TN = OrN+NH4+NO2-3, and TP = OrP+PO4.

The statistical significance of a linear regression depends on the correlation of the estimator with the regressors and the quality of the observed data (including their representatives to the whole period). The stations and periods selected for regression were under good sampling management by USGS, which includes regular sampling (semi-monthly or monthly) in baseflow conditions and additional sampling during stormflow conditions.

It was observed that the data after the beginning of 1986 for the Susquehanna River are more representative—it covers regular sampling (2 samples one month) representing base flow conditions, and more frequent sampling during storm flows (however, 1991 data cover less storm flows). Data for 1986-1990 were used for regression derivation, and 1984-1985 data and 1991-1992 data together with 1986-1990 data were used to check the goodness of estimation in the periods 1984-1985, 1991-1992, and the regression period 1986-1990, respectively. Therefore, in this study all of the observed data in the years with good sampling design were used for regression, and it was assumed that the observed data were "true" values.

TN, TP, NO2-3, PO4, OrN, and Sed were selected for the Susquehanna River. About 238 observations from 1986-1990 (an average of 48 observations per year) were used for each constituent regression. About 33 observations from 1984-1985 were available for the comparison of model estimation in the lower extended years, while 42 observations from 1991-1992 were available in the higher extended years.

With the same reasoning for the James River, 7/88-6/92 data were selected for regression derivation, while 7/88-6/92 and 7/92-6/94 data together with 7/86-6/88 data were used to check the goodness of estimation for the three periods. As TN and OrN data in 7/88-6/92 from the James River were not sufficient for MVUE to generate reasonable values of load estimate (although ORM could), only TP, NO2-3, PO4, Sed, and TKN were selected. About 90 observations for Sed or 200 observations for other constituents in the four years from 7/88 to 6/92 were used for regressions respectively, while about 14 observations were available for the lower-extended-year comparison and about 69 observations were available for the upper-extended-year comparison.

Correlation Analysis

Correlation analysis are important to setup regression equations. Based on a simple graphical method, the writer agreed generally with previous studies (Belval et al 1995, Cohn et al 1992). Therefore, the correlation analysis will only be reviewed briefly.

The previous work (Belval et al 1995, Cohn et al 1992) showed that the correlation of concentrations with flow or time is not as significant as the correlations of load with flow or load with time. The most significant correlations are flow with time, load with flow, and load with time. Surely, if load is merely dependent upon flow, and flow is a function of time, then the explicit expression of seasonal change of load with time may be the implicit correlation of flow with time. In such a case, time may not be considered as a regressor when flow is considered as a regressor for load estimate, otherwise a colinearity would occur which leads to an instability in the estimates and high standard errors. However, in addition to flow, other mechanisms or factors may cause differential responses of loads with time; therefore, time is considered as a regressor. This study agreed with the previous work (Preston et al 1989, Cohn et al 1989, 1992): the load of one constituent can be considered as a function of flow, as well as the time, and, consequently, the 7-parameter equation of Cohn et al. (1992) was adopted in this simulation.

Regression Derivation for Load Estimates

Setting up Regression Equations

The 7-parameter multi-variance log-linear regression equation (Cohn 1992, Belval 1995) is used:

ln[C] = ßo + ß1(ln[Q]) + ß2(ln[Q])2 + ß3sin(2pT) + ß4cos(2pT) + ß5T + ß6T2 + e (I),

where: ln[ ] = natural logarithm function, C = concentration of a constituent (in mg/l), Q = the instantaneous discharge (in cubic meter), T = time in years, sin = the sine function, cos = the cosine function, ßx = coefficient of the regression model, p = 3.14159, e = model errors. Note: Load: L = Q * C (in kg/day, but ton/day for sediment).

Eq. I for ORM and URM regressions was modified:

  1. Using log[load] on the left-hand side of the equation:

ln[L] = ßo + ß1(ln[Q]) + ß2(ln[Q])2 + ß3sin(2pT) + ß4cos(2pT) + ß5T + ß6T2 + e (II).

It would be advantageous to use log[load], instead of log[conc], on the left-hand side of the equation to emphasize the correlation of load with flow (the importance of which will be discussed later). The load estimates would have virtually no difference with log[load] or log[conc] in these two specific equations.

2. Adding a term of the reciprocal of ln[Q] to represent more variable cases of load-flow correlation:

ln[L] = ßo + ß1(ln[Q]) + ß2(ln[Q])2 + ß3/(ln[Q]) + ß4sin(2pT) + ß5cos(2pT) + ß6T + ß7T2 + e (III).

3. Using Q, Q2 and ÖQ, instead of ln[Q], (ln[Q])2 and 1/ln[Q] for the constituents with more flow-dependency, such as sediment.

ln[L] = ßo + ß1Q + ß2Q23ÖQ + ß4sin(2pT) + ß5cos(2pT) + ß6T + ß7T2 + e (IV).

Notations for Regression Derivation

The equation set for regression can be denoted with matrix notation. For a k-parameter (bk) regression with the estimator E (log[conc] for Eq. I or log[load] for Eq. II-IV) and k regressors Xk (such as ln[Q] and others) from n samples, a set of n equations can be set up:

E = X .......................... (j), where

E1 1, ln[Q1], ...other k-1 regressors ßo

E2 1, ln[Q2], ...other k-1 regressors ß1

E = .... X = ........ B = ....

.... ........ ßk

En 1, ln[Qn], ...other k-1 regressors

The equation-set (j) can be used to estimate log[load] or log[conc] (denoted as E*, with respect to the "true" value E^) for any dates with known Q, with errors which are usually expressed as residuals (RE).

The estimated load would be: L = exp(E) for Eqs. II-IV, or L = exp(E * Q) for Eq. I.

Let’s further denote the estimated loads as L*, true loads as L^, and residuals as RL, yielding L^ = L* + RL.

Bias Due to Transformation from Log-Space to Normal-Space

If the least-squares method is employed for a Best Linear Unbiased Estimate (BLUE), the curve of estimates (E*) has an overall lower difference compared to the corresponding true values (E^) than with other regression approaches. There would be no discrepancy between the expected value of the estimator and the population parameter being estimated. From a statistical point of view, with reference to an estimate of 5.0, the underestimate (with a true value of 5.1) would have the same chance with the same magnitude to that of overestimate (with a true value of 4.9). The ratio of underestimate to overestimate is: |5.1 - 5 / 4.9 - 5 | = | 0.1 / - 0.1| = 1. Therefore, ORM (a BLUE method) for log(load) estimate is unbiased. However, when the estimated log values are transformed from log space to normal space, a bias would occur. The ratio of exponential underestimate to the exponential overestimates is: | exp(5.1) - exp(5.0)| / | exp(4.9) - exp(5.0) | = > 1.

The underestimated load would be greater than the overestimated load after the transformation for the equal amount of underestimated and overestimated log[load]. This agrees with Cohn et al. (1992) that without bias correction load is generally underestimated for the log linear regressions (such as RC or ORM). This is the reason that MVUE has been a useful estimator applied by many researchers extensively (Belval et al 1995, Wang et al 1998). The magnitude of bias depends upon the magnitude of deviation in log[load]. The lower the deviation of log(load) estimate is, the smaller the bias due to the log-to-normal transformation.

Regression Used in This Paper

1. MVUE

MVUE uses Eq. I and applies the unbiasing correction, L^MVUE = (L*) gm, where gm is a Bessel function with the variables of estimated variances (Cohn et al 1989, 1992).

2. ORM

ORMs used Eq. II-IV for the regression derivation, without an unbiasing correction.

A) ORM1: Using Eq II.

B) ORM2: Using Eq. IV for sediments and Eq. III for other constituents.

C) ORM3: Using Eq. II, but the time in Terms 6 and 5 is a pure decimal time of the year, denoted as t.

The equations of ORMs are similar to that used for MVUE, with the exception log[load] (instead of log[conc]) on the left-hand site of the equation. The equation of ORM2 has modified flow terms; either Eq. III or Eq. IV was used depending on the correlation between load and flow; Eq. IV is for the constituent which load has high correlation (exponentially) with flow. The least-squares method was applied in the regression derivation.

The estimate of E*ORM (i.e., log[load]) is unbiased, however, the estimate of L*ORM (i.e., load) is biased due to the transformation from log space to normal space.

3. URM

URM used ORM equations and an unbiasing correction over ORM’s results: L^URM = (L*ORM) (U) (Y), where U is a function of Q and t, and Y is the correction coefficient to be derived by URM. The least-squares method was applied in the URM derivation.

A) URM1: The unbiasing was performed over ORM1 with the terms of U as 1/exp(1/Q) + (exp(t) + exp ( -t))/2 + ln[Q] + t for sediment and other constituents which show significant correlation of load with flow, or 1/exp(1/Q) + (exp(t) + exp(-t))/2 for other cases.

B) URM2: The unbiasing was performed over ORM2 with the terms of U as 1/exp(1/Q) + (exp(t) + exp ( -t))/2 + sin(2pT) + cos(2pT) for sediment or other constituents which show significant correlation of load with flow, or 1/exp(1/Q) + (exp(t) + exp(-t))/2 for other cases.

C) URM3: The unbiasing was performed over ORM3 with the terms of U as 1/exp(1/Q) + (exp(t) + exp ( -t))/2 + ln[Q] + t for the constituents from the James River, or ln[Q] + sin(t) + cos(t) for the constituents from the Susquehanna River.

Because the bias correction of URM directly applied load as the estimator with least-squares regression derivation, the load estimate (L*URM) is unbiased.

Load Estimate

The loads or concentrations of PO4, NO22-3, TP, OrN, TN, and Sed for the Susquehanna River, and PO4, NOx, TP, TKN and Sed for the James River, together with flow and time were input into the above equations, using 1986-1990 data for the Susquehanna River and 7/88-6/92 for the James River to derive regression, producing outputs of estimated daily loads in 1984-1992 for the Susquehanna River and 7/86-6/94 for the James River.

Results and Discussions

Introduction of Output Tables from Regression Results

The results of the regression for the selected constituents by each regression method are analyzed by the comparison with observed loads. Residuals (i.e., simulation - observation) are the major parameters. Therefore, only those days with observed data will be considered, and the set of all associated residuals is considered as the whole sample and assumed to represent the whole population.

The whole period is divided into three sub-periods, the regression period, the lower extended period, and the upper extended period as defined in the section of Selection of Observed Data. Due to limited space of publication, only three parameters are listed for each sub-period: 1) SR: summation of residuals (defined as S ( est - obs ), 2) Bias (B = 100 [SR / S obs] ), and 3) SE: standard error of estimate ( Ö S ( est - obs )2/(n-2) ), where obs is the observed value, est is the estimated value, and n is the number of samples. The results are compared only for those days with observed data, which are regarded as representative of the corresponding sub-periods.

SR represents the difference between the sum of estimated and the sum of observed, which has a similar meaning as the bias. Based on the sign convention of this paper, a negative value of SR means underestimates, overall. An SR can be very small when both under-estimates and over-estimates are very big but are close. Therefore, a low SR may not guarantee a regression to be a good estimator. While the standard error of estimate (SE) provides the statistics for deviations away from the regression line, it measures the amount of spread of the sample points about the regression line. Although the multiple coefficient of determination, R2, is also a useful statistic measure for multiple regression, which gives an indication of how well the multiple regression equation actually fits the available sample data. Both SE and R2 can be used to test the goodness of a regression with respect to "true" values. This discussion will be primarily based on the SE values. Standard deviation (SD), which involves deviations away from the mean for a set of samples or estimates, is also a useful statistic measure, but will not be discussed in this paper either excepting when citing Cohn et al. (1992) paper.

Comparison of Bias

Cohn et al. (1992) compared bias of MVUE with other methods. The values of bias of MVUE for the Susquehanna River in this calculation (Table 1) are similar to (but less than) those in Cohn et al. (1992). The differences may be due to 1) different time periods were used for regression, and 2) Cohn et al. statistics were based on the average B of subsamples, while this paper was based on all available observed samples in a period. A similar conclusion to Cohn et al. (1992) was derived from our calculations, in that in the regression period MVUE has lower bias than other regressions without bias corrections. ORMs are usually underestimated due to Type-2 bias. These calculations further show that the URM has virtually no bias, <0.01%, in the regression period, while MVUE still has considerable calculated bias (most are positive bias).

Cohn et al. (1992) did not simulate for extended periods. It is conceivable that biases are usually more significant in the extended periods than in the regression period by most methods. Tables 1 and 2 show that in the extended period, URMs have more or nearly equal cases with lower B values comparing ORMs, and ORMs generally have more cases with lower B than MVUE.

SE versus SR as Evaluating Parameters

This section will use an example to show that a method with lower total SR may have more chances of higher SR in split yearly or daily loads if SE is large.

Let’s pick out one sample set with lower SE but much higher SR in ORM (ORM3) than in MVUE, such as TP from the James River. Table 2 also lists yearly SR for TP. In the regression period, SE is 9130 and 9549 for ORM3 and MVUE, respectively, and SR is -307341 and 49035 for ORM3 and MVUE, respectively. Surely, for the total loads during the entire regression period (7/88-6/92), MVUE may have a closer estimate because of a lower SR. However, if the period is split yearly (into four years), the situation would be different. ORM3 has lower SR in two years (7/88-6/89 and 7/90-6/91), while MVUE has lower SR in other two years (7/89 -6/90 and 7/91-6/92). This means that although MVUE has lower SR in the entire regression period (7/88-6/92), if the period is subdivided into years, then ORM3 and MVUE could have same chances to have lower yearly SR. If the period is subdivided into smaller steps (monthly or daily), it is possible that ORM3 may have more chances to have lower SR than MVUE, as long as ORM3 has lower SE than MVUE. Moreover, as ORM3 is lower in both SE and SR than MVUE for TP from the James River in the extended years. Consequently, ORM3 has lower SRs in all the split four extended years. The analysis on other parameters where both SR and SE are lower in URM or ORM than in MVUE, the chances of lower yearly or monthly SR would be higher in the formers. Surely, a higher chance of lower yearly SR does not means the corresponding method is better. The above analysis is to show that SE is an important parameter for testing the goodness of a regression method, especially for daily estimates.

Comparison of the Results From MVUE and ORMs

· In the regression periods: MVUE has more chances of lower SE than ORM1, while maintaining nearly equal chances of lower SE to ORM2 and ORM3. MVUE usually has lower SR than ORM1, ORM2 and ORM3.

· In the extended periods: ORM1, ORM2 and ORM3 generally have lower SE than MVUE, and usually have lower SR than MVUE.

ORMs are usually fairly good in load estimates in the extended periods. A good estimation in extended periods may be desirable in many cases. For example, there are sufficient data for storm flow and base flows in 1986-1995, however, not for storm flows and base flows in 1984-85 and 1996. When using a regression method to estimate loads during 1984-1996, it may be better not to include the 1984-85 and 1996 data in the regression derivation, while 1984-85 and 1996 loads will be estimated from the regression equation established by 1986-1995 data. In such cases, ORM would be preferred. Similarly, a regression good in the extended period would enable users to use the established regression equations in the publication (which were based on detailed study with sound samples in a certain period) for their work.

Nevertheless, it should be noted, that ORMs generally yield an under-estimate in overall loads due to the down bias by the transformation from log[load] space to load space. From the discussion of bias generation it is understood that if log[load] can be closely estimated, then the bias due to the transformation may be less significant.

Comparing URM with ORM and MVUE

· In the regression period: SRs of URM are always close to zero, indicating URM is a virtually unbiased estimator; while SRs of MVUE are still high, yet generally lower than those of ORM.

SEs of URMs are lower than those of their ORM counterparts, indicating that URM improves the regression over ORM after the unbiasing correction. Even though ORM1 for some constituents (e.g., sediment of the Susquehanna River, Table 1) has a higher SE than MVUE in the regression period, its corresponding URM1 usually has lower SE than MVUE.

· In the extended period: URM and ORM have nearly equal chances of lower SE and SR values, and have more chances of lower SE and SR than MVUE.

The above demonstrates that the newly developed unbiasing methods (URMs) not only reduce SR, but also reduce SE with respect to ORMs. Further more, SE and SR of URM are usually lower than those of MVUE. Therefore URM is regarded as a better method. However, URM is sensitive to the correlation of constituent loads with flow and time. Correlation analysis is important in selecting equations, otherwise some unrealistic values may be generated in some cases.

Correlation Analysis Is Important in Regression Model

For a regression equation to be significant, it is better that the estimator has a significant correlation with regressors. The correlation analysis from the earlier section indicates that log[load] is significantly correlated with log[flow], while log[conc] is not. Therefore, the use of log[load] instead of log[conc] as a regressor is recommended. However, the regressions with the estimator of either log[conc] or log[load] in Eqs. I and II generate the same results. This does not mean that the correlation is not important. Because in this specific equation, log[conc] on the right-hand side can be expressed with log[load] - log[flow], and the regressors contain the log[flow] term on the left-hand side, therefore, Eq. I is equivalent to Eq. II with one more unit of log[flow] which is included in coefficient ß1. Therefore, the load estimate with both equations would have no differences. This does not mean that the estimator-regressors correlation condition is not important in the regression. If one removes log[flow] but keeps the other flow terms (such as (log[Q])2 and/or ÖQ and/or 1/Q), then the load estimates would be different between the estimators with log(C) and log(load). This demonstrates the importance of correlation (load, instead of concentration, with flow) on regression.

The correlation analysis showed that the time terms may not affect on load as significantly as flow does. ORM and URM were applied on the regressions without the regressors containing time. Some results were close to those produced from the regression with all regressors, while some were not. This may reflect the importance of load-time correlation. The Cohn’s 7-parameter equation gave good considerations on the variation of concentration with time.

Usually there is no unique regression equation for different constituents. Load-flow correlation could be quite different among constituents depending on constituents behaviors and the overall characteristics of the watershed. It is agreed with Preston et al. (1989) that the choice of an approach for estimation should be based on the nature and characteristics of the data that will be utilized. It is recommended to study the correlation of estimator and candidates of regressor when developing a regression model.

Recommended Method: URM, or a Combination of URM with ORM and MVUE

Because URM generally has lower SD and SR than ORM or MVUE, URM is considered to be a better method. However, because the unbiasing is based on observed data, URM may generate greater deviations than ORM in the extended periods in some cases. Therefore, a certain combination of URM with ORM and MVUE may be applied.

This study is a preliminary one. A study on the applications of regression methods to various data may be useful in further evaluation of the methods, and could provide suggestions to develop better regression methods for load estimate.

Conclusions

This study presents a new approach of a regression model to estimate fluvial loads based on the correlation of load with flow and time. The URM is a virtual unbiased estimator with low SE and SR, having improvement over ORMs and MVUE in some aspects. Because of the quality and representativity of samples, and the correlationship of loads or concentrations with stream factors are different among streams, there is no unique form of regression which is better than the others in estimating load of constituents, especially if some of the effective stream factors are not (or cannot be) fully simulated. Therefore, comparing a few methods and choosing a suitable one may be practical for more accurate estimates. Nevertheless, the combination of URM with ORM and MVUE, is generally recommended.

References

Belval, D.L., P.J. Campbell, S.W. Phillips, and C.F. Bell. 1995. Water quality characteristics of five tributaries to the Chesapeake Bay at the fall line, Virginia, July 1988 through June 1993. USGS Water-Resources Investigations Report 95-4258, USGS, Richmond, Virginia, 71pp.

Bradu, D., and Y. Mundlak. 1970. Estimation in lognormal linear models. J. Am. Stat. Assoc. 65(329): 198-211.

Cohn, T.A., D. L. Caulder, E.J. Gilroy, L.D. Zynjuk, and R.M. Summers. 1992. The validity of a simple statistical model for estimating fluvial constituent loads: an empirical study involving nutrient loads entering Chesapeake Bay. Water Resour. Res. 28(9): 2353-2363.

Cohn, T.A., L. L. DeLong, E.J. Gilroy, R.M. Hirsch, and D.K. Wells. 1989. Estimating constituent load. Water Resour. Res., 25(5): 937-942.

Duan, N. 1983. Smearing estimate: A nonparametric retransformation method, J. Am. Stat. Assoc. 78(383), 605-610

Gilbert, R.O. 1987. Statistic methods for environmental pollution monitoring. Nostrand Reinhold Co., NY. 313pp.

Gilroy E.J., R.M. Hirsch, and T.A. Cohn. 1990. Mean square error of regression-based constituent transport estimates. Water Resour. Res. 26(9): 2069-2077.

Grewal, P.S. 1990. Methods of Statistical Analysis. Sterling Publishers, New York, New York, 1304pp.

Miller, C.R. 1951. Analysis of flow-duration, sediment-rating curve method of computing sediment yield, report. U.S. Bur. of Reclam., Denver, Colo. 15 pp.

Preston, S.D., V.J. Bierman, Jr., and S.E. Silliman. 1989. An evaluation of methods for the estimation of tributary mass loads. Water Resour. Res. 25(6): 1379-1389.

Walling, D.E. 1977. Assessing the accuracy of suspended sediment rating curves for a small basin. Water Resour. Res. 13(3): 531-538.

Wang, P., L.C. Linker, and J. Storrick. 1998. Chesapeake Bay Watershed Model Application & Calculation of Nutrient & Sediment Loadings, Appendix G: Observed data used for calibration, a regression model, and a confirmation scenario of Phase IV Watershed Model. EPA/CBPO document (in preparation; to be printed in 8/98).

Thomas, R.B. 1988. Monitoring baseline suspended sediment in forested basins: The effects of sampling of suspended sediment rating curves. Hydrol. Sci., 33(5): 499-514.