The first statistical problem we posed in this book, in connection with Figure 1-2A, dealt with a drug that was thought to be a diuretic, but that experiment cannot be analyzed using our existing procedures. In it, we selected different people and gave them different doses of the diuretic, then measured their urine output. The people who received larger doses produced more urine. The statistical question is whether the resulting pattern of points relating urine production to drug dose provided sufficient evidence to conclude that the drug increased urine production in proportion to drug dose. This chapter develops the tools for analyzing such experiments. We will estimate how much one variable increases (or decreases) on the average as another variable changes with a regression line and quantifies the strength of the association with a correlation coefficient.*
* Simple linear regression is a special case of the more general method of multiple regression in which case there are multiple independent variables. For a discussion of multiple regression and related procedures written in the same style as this book, see Glantz SA, Slinker BK. Primer of Applied Regression and Analysis of Variance, 2nd ed. New York: McGraw-Hill; 2001.
As in all other statistical procedures, we want to use a sample drawn at random from a population to make statements about the population. Chapters 3 and 4 discussed populations whose members are normally distributed with mean μ and standard deviation σ and used estimates of these parameters to design test statistics (such as F and t) that permitted us to examine whether or not some discrete treatment was likely to have affected the mean value of a variable of interest. Now, we add another parametric procedure, linear regression, to analyze experiments in which the samples were drawn from populations characterized by a mean response that varies continuously with the size of the treatment. To understand the nature of this population and the associated random samples, we return again to Mars, where we can examine the entire population of 200 Martians.
Figure 2-1 showed that the heights of Martians are normally distributed with a mean of 40 cm and a standard deviation of 5 cm. In addition to measuring the heights of each Martian, let us also weigh each one. Figure 8-1 shows a plot in which each point represents the height x and weight y of one Martian. Since we have observed the entire population, there is no question that tall Martians tend to be heavier than short Martians.
Figure 8-1.
The relationship between height and weight in the population of 200 Martians, with each Martian represented by a circle. The weights at any given height follow a normal distribution. In addition, the mean weight of Martians at any given height increases linearly with height, and the variability in weight at any given height is the same regardless of height. A population must have these characteristics to be suitable for linear regression or correlation analysis.
There are a number of things we can conclude about the heights and weights of Martians as well as the relationship between these two variables. As noted in Chapter 2, the heights are normally distributed with mean μ = 40 cm and standard deviation σ = 5 cm. The weights are also normally distributed with mean μ = 12 g and standard deviation σ = 2.5 g. The most striking feature of Figure 8-1, however, is that the mean weight of Martians at each height increases as height increases.
For example, the Martians who are 32 cm tall weigh 7.1, 7.9, 8.3, and 8.8 g, so the mean weight of Martians who are 32 cm tall is 8 g. The eight Martians who are 46 cm tall weigh 13.7, 14.5, 14.8, 15.0, 15.1, 15.2, 15.3, and 15.8 g, so the mean weight of Martians who are 46 cm tall is 15 g. Figure 8-2 shows that the mean weight of Martians at each height increases linearly as height increases.
This line does not make it possible, however, to predict the weight of an individual Martian if you know his or her height. Why not? There is variability in weights among Martians at each height. Figure 8-1 reveals that standard deviation of weights of Martians with any given height is about 1 g. We need to distinguish this standard deviation from the standard deviation of weights of all Martians computed without regard for the fact that mean weight varies with height.
Now, let us define some new terms and symbols so that we can generalize from Martians to other populations with similar characteristics. Since we are considering how weight varies with height, call height the independent variable x and weight the dependent variable y. In some instances including the example at hand, we can only observe the independent variable and use it to predict the expected mean value of the dependent variable. (There is variability in the dependent variable at each value of the independent variable). In other cases, including controlled experiments, it is possible to manipulate the independent variable to control, with some uncertainty, the value of the dependent variable. In the first case, it is only possible to identify an association between the two variables, whereas in the second case it is possible to conclude that there is a causal link.*
For any given value of the independent variable x, it is possible to compute the value of the mean of all values of the dependent variable corresponding to that value of x. We denote this mean μy·x to indicate that it is the mean of all the values of y in the population at a given value of x. These means fall along a straight line given by
in which α is the intercept and β is the slope† of the line of means. For example, Figure 8-2 shows that, on the average, the average weight of Martians increases by 0.5 g for every 1 cm increase in height, so the slope β of the μy·x versus x line is 0.5 g/cm. The intercept α of this line is −8 g. Hence,
There is variability about the line of means. For any given value of the independent variable x, the values of y for the population are normally distributed with mean μy·x and standard deviation σy·x. This notation indicates that σy·x is the standard deviation of weights (y) computed after allowing for the fact that mean weight varies with height (x). As noted above, the residual variation about the line of means for our Martians is 1 g; σy·x =1 g. The amount of this variability is an important factor in determining how useful the line of means is for predicting the value of the dependent variable, for example, weight, when you know the value of the independent variable, for example, height. The methods we develop below require that this standard deviation be the same for all values of x. In other words, the variability of the dependent variable about the line of means is the same regardless of the value of the independent variable.
In summary, we will be analyzing the results of experiments in which the observations were drawn from populations with these characteristics:
- The mean of the population of the dependent variable at a given value of the independent variable increases (or decreases) linearly as the independent variable increases.
- For any given value of the independent variable, the possible values of the dependent variable are distributed normally.
- The standard deviation of population of the dependent variable about its mean at any given value of the independent variable is the same for all values of the independent variable.
The parameters of this population are α and β, which define the line of means, the dependent-variable population mean at each value of the independent variable, and σy·x, which defines the variability about the line of means.
Now let us turn our attention to the problem of estimating these parameters from samples drawn at random from such populations.
* In an observational study, statistical analysis alone only permits identification of an association. In order to identify a causal relationship, one generally requires independent evidence to explain the biological (or other) mechanisms that give rise to the observed association. For example, the fact that several epidemiological studies demonstrated an association between passive smoking and heart disease combined with laboratory studies showing short-term effects of secondhand smoke and secondhand smoke constituents on the heart, led to the conclusion that passive smoking causes heart disease. For details on how a variety of such evidence is combined to use observational studies as part of the case for a causal relationship, see Glantz SA, Parmley WW. Passive smoking and heart disease: epidemiology, physiology, and biochemistry. Circulation. 1991;83:1–12 and Barnoya J, Glantz S. Cardiovascular effects of secondhand smoke: nearly as large as smoking. Circulation 2005;111:2684–2698.
† It is, unfortunately, statistical convention to use α and β in this way even though the same two Greek letters also denote the size of the Type I and Type II errors in hypothesis testing. The meaning of α should be clear from the context. β always refers to the slope of the line of means in this chapter.
Since we observed the entire population of Mars there was no uncertainty how weight varied with height. This situation contrasts with real problems, in which we cannot observe all members of a population and must infer things about it from a limited sample which we hope is representative. To understand the information that such samples contain, let us consider a sample of 10 individuals selected at random from the population of 200 Martians. Figure 8-3A shows the members of the population that happened to be selected; Figure 8-3B shows what an investigator or reader would see. What do the data in Figure 8-3B allow you to say about the underlying population? How certain can you be about the resulting statements?
Simply looking at Figure 8-3B reveals that weight increases as height increases among the 10 specific individuals in this sample. The real question of interest, however, is: Does weight vary with height in the population the sample came from? After all, there is always a chance that we could draw an unrepresentative sample, just as in Figure 1-2. Before we can test the hypothesis that the apparent trend in the data is due to chance rather than a true trend in the population, we need to estimate the population trend from the sample. This task boils down to estimating the intercept α and slope β of the line of means.
We will estimate the two population parameters α and β with the intercept and slope, a and b, of a straight line placed through the sample points. Figure 8-4 shows the same sample as Figure 8-3B with four proposed lines, labeled I, II, III and IV. Line I is obviously not appropriate; it does not even pass through the data. Line II passes through the data but has a much steeper slope than the data suggest is really the case. Lines III and IV seem more reasonable; they both pass along the cloud defined by the data points. Which one is best?
Figure 8-4.
Four different possible lines to estimate the line of means from the sample in Figure 8-3. Lines I and II are unlikely candidates because they fall so far from most of the observations. Lines III and IV are more promising.
To select the best line and so get our estimates a and b of α and β, we need to define precisely what “best” means. To arrive at such a definition, first think about why line II seems better than line I and line III seems better than line II. The “better” a straight line is, the closer it comes to all the points taken as a group. In other words, we want to select the line that minimizes the total variability between the data and the line. The farther any one point is from the line, the more the line varies from the data, so let us select the line that leads to the smallest total variability between the observed values and the values predicted from the straight line.
The problem becomes one of defining a measure of variability then selecting values of a and b to minimize this quantity. Recall that we quantified variability in a population with the variance (or standard deviation) by computing the sum of the squared deviations from the mean and then divided by the sample size, n, minus 1. Now we will use the same idea and use sum of the squared differences between the observed values of the dependent variable and the value on the line at the same value of the independent variable as our measure of how much any given line varies from the data. We square the deviations so that positive and negative deviations contribute equally. Figure 8-5 shows the deviations associated with lines III and IV in Figure 8-4. The sum of squared deviations is smaller for line IV than line III, so it is the better line. In fact, it is possible to prove mathematically that line IV is the one with the smallest sum of squared deviations between the observations and the line,* making it the “best” line. For this reason, this procedure is often called the method of least squares or least-squares regression.
Figure 8-5.
Lines III (A) and IV (B) in Figure 8-4, together with the deviations between the lines and the observations. Line IV is associated with the smallest sum of squared deviations between the regression line and the observed values of the dependent variable. The vertical lines indicate the deviations. The black line is the line of means for the population of Martians in Figure 8-1. The regression line approximates the line of means but does not precisely coincide with it. Line III is associated with larger deviations than line IV.
The resulting line is called the regression line of y on x (in this case, the regression line of weight on height). Its equation is
ŷ denotes the value of y on the regression for a given value of x. This notation distinguishes it from the observed value of the dependent variable Y. The intercept a is given by
and the slope is given by
in which X and Y are the coordinates of the n points in the sample.*
Table 8-1 shows these computations for the sample of 10 points in Figure 8-3B. From this table, n = 10, ΣX = 369 cm, ΣY = 103.8 g, ΣX2 = 13,841 cm2, and ΣXY = 3930.1 g · cm. Substitute these values into the equations for the intercept and slope of the regression line to find
and
Observed Height X (cm) | Observed Weight Y (g) | X2 (cm2) | XY (g · cm) |
---|---|---|---|
31 | 7.8 | 961 | 241.8 |
32 | 8.3 | 1,024 | 265.6 |
33 | 7.6 | 1,089 | 250.8 |
34 | 9.1 | 1,156 | 309.4 |
35 | 9.6 | 1,225 | 336.0 |
35 | 9.8 | 1,225 | 343.0 |
40 | 11.8 | 1,600 | 472.0 |
41 | 12.1 | 1,681 | 496.1 |
42 | 14.7 | 1,764 | 617.4 |
46 | 13.0 | 2,116 | 598.0 |
369 | 103.8 | 13,841 | 3,930.1 |
Line IV in Figures 8-4 and 8-5B is this regression line.
These two values are estimates of the population parameters, α = −8 g and β = 0.5 g/cm, the intercept and slope of the line of means. The light line in Figure 8-5B shows the line of means.
* For this proof and a derivation of the formulas for the slope and intercept of this line, see Glantz SA, Slinker BK. Primer of Applied Regression and Analysis of Variance, 2 ed. New York: McGraw-Hill; 2001, 19.
† The calculations can be simplified by computing b first, then finding a from a = − b, in which and are the means of all observations of the independent and dependent variables, respectively.
We have the regression line to estimate the line of means, but we still need to estimate the variability of population members about the line of means, σy·x. We estimate this parameter by computing the square root of the “average” squared deviation of the data about the regression line
where a + bX is the value ŷ on the regression line corresponding to the observation at X; Y is the actual observed value of y; Y − (a + bX) is the amount that the observation deviates about the regression line and Σ denotes the sum, over all the data points, of the squares of these deviations [Y − (a + bX)]2. We divide by n − 2 rather than n for reasons analogous to dividing by n − 1 when computing the sample standard deviation as an estimate of the population standard deviation. Since the sample will not show as much variability as the population, we need to decrease the denominator when computing the “average” squared deviation from the line to compensate for this tendency to underestimate the population variability.
sy·x is called the standard error of the estimate. It is related to the standard deviations of the dependent and independent variables and the slope of the regression line according to
where sY and sX are the standard deviations of the dependent and independent variables, respectively.
For the sample shown in Figure 8-3B (and Table 8-1), sX = 5.0 cm and sY = 2.4 g, so
This number is an estimate of the actual variability about the line of means, σy·x = 1 g.
Just as the sample mean is only an estimate of the true population mean, the slope and intercept of the regression line are only estimates of the slope and intercept of the line of means in the population. In addition, just as different samples yield different estimates for the population mean, different samples will yield different regression lines. After all, there is nothing special about the sample in Figure 8-3. Figure 8-6A shows another sample of 10 individuals drawn at random from the population of all Martians. Figure 8-6B shows what you would see. Like the sample in Figure 8-3B, the results of this sample also suggest that taller Martians tend to be heavier, but the relationship looks a little different from that associated with our first sample. This sample yields estimates of a = −4.0 g and b = 0.38 g/cm as estimates of the intercept and slope of the line of means.
Figure 8-6.
This figure illustrates a second random sample of 10 Martians drawn from the population in Figure 8-1. This sample is associated with a different regression line than that computed from the first sample, shown in Figure 8-5A.
There is a population of possible values of a and b corresponding to all possible samples of a given size drawn from the population in Figure 8-1. These distributions of all possible values of a and b have means α and β, respectively, and standard deviations σa and σb called the standard error of the intercept and standard error of the slope, respectively.
These standard errors can be used just as we used the standard error of the mean and standard error of a proportion. Specifically, we will use them to test hypotheses about, and compute confidence intervals for, the regression coefficients and the regression equation itself.
The standard deviation of the population of all possible values of the regression line intercept, the standard error of the intercept, can be estimated from the sample with*
The standard error of the slope of the regression line is the standard deviation of the population of all possible slopes. Its estimate is
From the data in Figure 8-3B and Table 8-1 it is possible to compute the standard errors for the slope and intercept as
and
Like the sample mean, both a and b are computed from sums of the observations. Like the distributions of all possible values of the sample mean, the distributions of all possible values of a and b tend to be normally distributed. (This result is another consequence of the central-limit theorem.) The specific values of a and b associated with the regression line are then randomly selected from normally distributed populations. Therefore, these standard errors can be used to compute confidence intervals and test hypotheses about the intercept and slope of the line of means using the t distribution, just as we did for the sample mean in Chapter 7.
* For a derivation of these formulas, see Neter J, Kutner MH, Nachtsheim CJ, Wasserman W. Inferences in regression analysis. Applied Linear Statistical Models: Regression, Analysis of Variance, and Experimental Designs. Boston: WCB McGraw-Hill; 1996:chap 2.
There are many hypotheses we can test about regression lines, but the most common and important one is that the slope of the line of means is zero. This hypothesis is equivalent to estimating the chance that we would observe a trend as strong or stronger than the data show when there is actually no relationship between the dependent and independent variables. The resulting P value quantifies the certainty with which you can reject the hypothesis that there is no linear trend relating the two variables.*
Since the population of possible values of the regression slope is approximately normally distributed, we can use the general definition of the t statistic
to test this hypothesis. The equivalent mathematical statement is
This equation permits testing the hypothesis that there is no trend in the population from which the sample was drawn, that is, β = 0, using either of the approaches to hypothesis testing developed earlier.
To take a classic hypothesis-testing approach (as in Chapter 4), set β to zero in the equation above and compute
then compare the resulting value of t with the critical value tα defining the 100α percent most extreme values of t that would occur if the hypothesis of no trend in the population was true.
For example, the data in Figure 8-3B (and Table 8-1) yielded b = 0.44 g/cm and sb = 0.064 g/cm from a sample of 10 points. Hence, t = 0.44/0.064 = 6.875, which exceeds 5.041, the value of t for P < .001 with ν = 10 − 2 = 8 degrees of freedom (from Table 4-1). Hence, it is unlikely that this sample was drawn from a population in which there was no relationship between the independent and dependent variables, that is, height and weight. We can use these data to assert that as height increases, weight increases (P < .001).
Of course, like all statistical tests of hypotheses, this small P value does not guarantee that there is really a trend in the population, it just means it is unlikely that there is not such a trend. For example, the sample in Figure 1-2A is associated with P < .0005. Nevertheless, as Figure 1-2B shows, there is no trend in the underlying population.
If we wish to test the hypothesis that there is no trend in the population using confidence intervals, we use the definition of t above to find the 100(1 − α) percent confidence interval for the slope of the line of means,
We can compute the 95% confidence interval for β by substituting the value of t.05 with v = n − 2 = 10 − 2 = 8 degrees of freedom, 2.306, into this equation together with the observed values of b and sb
Since this interval does not contain zero, we can conclude that there is a trend in the population (P < .05).† Note that the interval contains the true value of the slope of the line of means, β = 0.5 g/cm.
It is likewise possible to test hypotheses about, or compute confidence intervals for, the intercept using the fact that
is distributed according to the t distribution with ν = n − 2 degrees of freedom. For example, since sa = 2.6 g the 95% confidence interval for the intercept based on the observations in Figure 8-3B is
which includes the true intercept of the line of means, α = −8 g.
A number of other useful confidence intervals associated with regression analysis, such as the confidence interval for the line of means, will be discussed next.
* This restriction is important. As discussed later in this chapter, it is possible for there to be a strong nonlinear relationship in the observations and for the procedures we discuss here to miss it.
† The 99.9% confidence interval does not contain zero either, so we could obtain the same P value (.001) as with the first method using confidence intervals as with t = b/sb earlier in this session.
There is uncertainty in the estimates of the slope and intercept of the regression line. The standard errors of the slope and the intercept, sa and sb, quantify this uncertainty. These standard errors are sa = 2.6 g and sb = 0.06 g/cm for the regression of height or weight for the Martians in the sample in Figure 8-3. Thus, the line of means could lie slightly above or below the observed regression line or have a slightly different slope. It nevertheless is likely that the line of means lies within a band surrounding the observed regression line. Figure 8-7A shows this region. It is wider at the ends than in the middle because the regression line must be straight and must go through the point defined by the means of the independent and dependent variables.
Figure 8-7.
(A) The 95% confidence interval for the regression line relating Martian weight or height using the data in Figure 8-3. (B) The 95% confidence interval for an additional observation of Martian weight at a given height. This is the confidence interval that should be used to estimate true weight from height to be 95% confident that the range includes the true weight.
There is a distribution of possible values for the regression line at each value of the independent variable x. Since these possible values are normally distributed about the line of means, it makes sense to talk about the standard error of the regression line. (This is another consequence of the central-limit theorem.) Unlike the other standard errors we have discussed so far, this standard error is not constant but depends on the value of the independent variable x:
Since the distribution of possible values of the regression line is normally distributed, we can compute the 100(1 − α) percent confidence interval for the regression line with
in which tα has ν = n − 2 degrees of freedom and ŷ is the point on the regression line for each value of x,
Figure 8-7A shows the 95% confidence interval for the line of means. It is wider at the ends than the middle, as it should be. Note also that it is much narrower than the range of the data because it is the confidence interval for the line of means, not the population as a whole.
It is not uncommon for investigators to present the confidence interval for the regression line and discuss it as though it were the confidence interval for the population. This practice is analogous to reporting the standard error of the mean instead of the standard deviation to describe population variability. For example, Figure 8-7A shows that we can be 95% confident that the mean weight of all 40 cm tall Martians is between 11.0 and 12.5 g. We cannot be 95% confident that the weight of any one Martian that is 40 cm tall falls in this narrow range.
To compute a confidence interval for an individual observation, we must combine the total variability that arises from the variation in the underlying population about the line of means, estimated with, sy·x, and the variability due to uncertainty in the location of the line of means sŷ. Since the variance of a sum is the sum of the variances, the standard deviation of the predicted value of the observation will be
We can eliminate sŷ from this equation by replacing it with the equation for sŷ in the last section
This standard error can be used to define the 100(1 − α) percent confidence interval for an observation according to
Remember that both and sY new depend on the value of the independent variable x.
The two curved lines around the regression line in Figure 8-7B show the 95% confidence interval for an additional observation. This band includes both the uncertainty due to random variation in the population and variation due to uncertainty in the estimate of the true line of means. Notice that most members of the sample fall in this band. It quantifies the uncertainty in using Martian height to estimate weight, and hence, the uncertainty in the true weight of a Martian of a given height. For example, it shows that we can be 95% confident that the true weight of a 40 cm tall Martian is between 9.5 and 14.0 g. This confidence interval describes the precision with which it is possible to estimate the true weight. This information is much more useful than the fact that there is a statistically significant* relationship between the Martian weight and height (P < .001).
* t = b/sb = .44/.064 = 6.875 for the data in Figure 8-3. t.001 for ν = 10 − 2 = 8 degrees of freedom is 5.041.
Motivated by the human and animal studies showing that cell phone use was associated with lower sperm motility, Geoffry De Luliis and colleagues† conducted an experiment in which they exposed normal human sperm dishes to cell phone electromagnetic signals and measured the production of intracellular reactive oxygen species (ROS) produced in cellular mitochondria that can damage DNA as well as the amount of DNA damage.
They exposed the sperm (obtained from students with no known reproductive problems or infections) to cell phone signals of varying strengths for 16 hours and investigated the relationship between the strength of the signal and the level of ROS production and DNA damage. The sperm were exposed in petri dishes maintained at a constant 21°C temperature to avoid the problem that the higher radiofrequency radiation from stronger cell phone signals would heat the sperm more, which would affect sperm function. By holding temperature constant, Di Luliis and colleagues avoided the effects of this potential confounding variable.
They sought to investigate whether there was a dose-dependent effect of cell phone exposure on the amount of ROS produced by the sperm and the level of DNA damage to the sperm.
The independent variable in their study was cell phone signal strength, measured at the specific absorption rate (SAR), for the cell phone. (SAR is the rate of absorption of an electromagnetic radiation from a cell phone by a model designed to simulate a human head.) The dependent variables were the fraction of sperm that tested positive on a MitroSOX red test for ROS. A second question is whether there is a relationship between the level of ROS and a second dependent variable, the fraction of sperm that expressed 8-hydroxy-2′-deoxyguanosine (8-OH-dg), a marker for oxidative damage to sperm DNA.
Table 8-2 shows the data for this study.
Cell-Phone-Specific Absorption Rate, SAR (W/kg) | Sperm with Mitochondrial ROS (%) | Sperm with DNA Damage (%) |
---|---|---|
0.4 | 8 | 5 |
27.5 | 29 | 18 |
0.0 | 6 | 3 |
1.0 | 13 | 8 |
2.8 | 16 | 10 |
10.1 | 27 | 15 |
2.8 | 18 | 5 |
27.5 | 30 | 13 |
10.1 | 25 | 15 |
4.3 | 25 | 7 |
4.3 | 23 | 8 |
1.0 | 15 | 4 |
1.0 | 11 | 3 |
Figure 8-8 shows the relationship between the percentage of sperm that tested positive for ROS, R, as a function of the SAR, S, together with the results of doing a linear regression on these data. Even though the slope is significantly different from zero (P < .001), the regression line does not provide an accurate description of the data, which shows a rapid increase in ROS generation at low SAR levels, then flattens out. This example illustrates the importance of always looking at a plot of the data together with the associated regression line to ensure that the central assumptions of linear regression—that the line of means is a straight line and that the residuals are randomly and normally distributed around the regression line—are met. In this case, neither assumption is satisfied so we cannot use linear regression to make statements about the relationship apparent in these data.* Therefore, while this figure seems to indicate a strong relationship between the strength of the cell phone signal and the level of oxidative damage, we cannot make any statistical statements about the confidence we have in making such a statement using linear regression.
Figure 8-8.
The fraction of sperm with positive tests for mitochondrial reactive oxygen species increases with the intensity of the electromagnetic radiation produced by the cell phone, but this increase is not linear, so linear regression cannot be used to test hypotheses about the relationship between these two variables.
We are luckier about the relationship between the level of ROS and DNA damage (Fig. 8-9) where the assumptions of linear regression are satisfied. (Compare how well the regression line goes through the data compared with how it does not in Fig. 8-8.) Box 8-1 shows that the regression line has a slope of .505 with a standard error of .105 and an intercept of −.796% with a standard error of 2.27%. We test whether the slope and intercept are significantly different from zero by computing t = b/sb = .505/.105 = 4.810 and t = a/sa = −.796/2.27 = .351, respectively. We compare these values of t associated with the regression with the critical value of t that defines the 95% most extreme values of the t distribution with ν = n − 2 = 13 − 2 = 11 degrees of freedom (from Table 4-2), 2.201. Since the t for the slope exceeds this value, we reject the null hypothesis of no (linear) relationship between sperm ROS level and DNA damage and conclude that increased levels of ROS are associated with higher levels of DNA damage. (In fact, the value of t associated with the slope exceeds 4.437, the critical value for P < .001.) In contrast, the t for the intercept does not even approach the critical value, so we do not reject the null hypothesis that the intercept is zero. The overall conclusion from these two tests is that the fraction of sperm with DNA damage increases in proportion to the fraction of sperm with elevated levels of ROS. Therefore, based on this experiment, we can conclude with a high level of confidence that higher levels of ROS production in sperm mitochondria cause DNA damage (P < .001).
Figure 8-9.
Sperm with higher levels of mitochondrial reactive oxygen species have higher levels of DNA damage. In contrast to the results in Figure 8-8, the data are consistent with the assumptions of linear regression, so we can use linear regression to draw conclusions about this relationship.
The first two columns of the table below present the data from Table 8-2 (last two columns), together with the square of the independent variable values and the product of the independent and dependent variables and the sums necessary to compute the linear regression. | ||||||
Calculation of Linear Regression of DNA Damage | ||||||
Sperm with Mitochondrial ROS (%) | Sperm with DNA Damage (%) | Fit Regression Line | Residual | Residual2 | ||
X | Y | X2 | XY | (Y − ) | (Y − )2 | |
8 | 5 | 64 | 40 | 3.25 | 1.75 | 3.07 |
29 | 18 | 841 | 522 | 13.86 | 4.14 | 17.12 |
6 | 3 | 36 | 18 | 2.24 | 0.76 | 0.58 |
13 | 8 | 169 | 104 | 5.78 | 2.22 | 4.95 |
16 | 10 | 256 | 160 | 7.29 | 2.71 | 7.33 |
27 | 15 | 729 | 405 | 12.85 | 2.15 | 4.61 |
18 | 5 | 324 | 90 | 8.30 | −3.30 | 10.91 |
30 | 13 | 900 | 390 | 14.37 | −1.37 | 1.87 |
25 | 15 | 625 | 375 | 11.84 | 3.16 | 9.98 |
25 | 7 | 625 | 175 | 11.84 | −4.84 | 23.43 |
23 | 8 | 529 | 184 | 10.83 | −2.83 | 8.01 |
15 | 4 | 225 | 60 | 6.79 | −2.79 | 7.76 |
11 | 3 | 121 | 33 | 4.76 | −1.76 | 3.11 |
246 | 114 | 5444 | 2556 | 102.75 | ||
n = 13; = 246/13 = 18.9%; sX = 8.11%. | ||||||
The intercept of the regression line is and the slope is Thus, the regression equation is | ||||||
We use the regression equation to compute the predicted value of y for each observed X; for example, for the first observation, X = 8, and the associated residual is Thus, the standard error of the estimate is | ||||||
The standard deviation of the observed values of the independent value, sX, is 8.11%, so the standard error of the intercept and slope are and |
† De Iuliis GN, Newey RJ, King BV, Aitken RJ. Mobil phone radiation induces reactive oxygen species production and DNA damage in human spermatoza in vitro. PLoS One. 2010;4(7):e6446. doi:10.1371/journal.pone.0006446.
* Examining the relationship between ROS formation damage and SAR suggests a saturating exponential,
where the two parameters are R∞, the maximum fraction of sperm with testing positive for ROS, and s, the exponential rate at which ROS increases. Such a relationship would occur if the rate of sperm becoming ROS positive depends on the fraction of sperm that are not yet ROS positive. It is possible to fit a nonlinear equation to these data. See Glantz S, Slinker B. Nonlinear regression. Primer of Applied Regression and Analysis of Variance, 2nd ed. New York: McGraw-Hill; 2001:chap 11.
The situation often arises in which one wants to compare two regression lines. There are actually three possible comparisons one might want to make:
- Test for a difference in slope (without regard for the intercepts).
- Test for a difference in intercept (without regard for the slopes).
- Make an overall test of coincidence, in which we ask if the lines are different.
The procedures for comparing two slopes or intercepts are a direct extension of the fact that the observed slopes and intercepts follow the t distribution. For example, to test the hypothesis that two samples were drawn from populations with the same slope of the line of means, we compute
where the subscripts 1 and 2 refer to data from the first and second regression data samples. This value of t is compared to the critical value of the t distribution with ν = n1 + n2 − 4 degrees of freedom. This test is exactly analogous to the definition of the t test to compare two sample means.
If the two regressions are based on the same number of data points, the standard error of the difference of two regression slopes is
If there are a different number of points, use the pooled estimate of the difference of the slopes. Analogous to the pooled estimate of the variance in the t test in Chapter 4, compute a pooled estimate of the variation about the regression lines as
if there are the same number of points for each regression equation, and we use a formula based on the pooled variance estimate if there are unequal number of points in the two regressions.
* This section deals with more advanced material and can be skipped without loss of continuity. It is also possible to test for differences between three or more regression lines using techniques which are generalizations of regression and analysis of variance; see Zar JH. Comparing simple linear regression equations. Biostatistical Analysis, 4th ed. Upper Saddle River, NJ: Prentice-Hall; 1999:chapter 18. For a discussion of how to use multiple regression models to compare several regression lines, including how to test for parallel shifts between regression lines, see Glantz S, Slinker B. Regression with two or more independent variables. Primer of Applied Regression and Analysis of Variance, 2nd ed. New York: McGraw-Hill; 2001:chap 3.
It is also possible to test the null hypothesis that two regressions are coincident, that is, have the same slope and intercept. Recall that we computed the slope and intercept of the regression line by selecting the values that minimized the total sum of squared differences between the observed values of the dependent variable and the value on the line at the same value of the independent variable (residuals). The square of the standard error of the estimate, sy·x, is the estimate of this residual variance around the regression line and it is a measure of how closely the regression line fits the data. We will use this fact to construct our test by examining whether fitting the two sets of data with separate regression lines (in which the slopes and intercepts can be different) produces smaller residuals than fitting all the data with a single regression line (with a single slope and intercept).
The specific procedure for testing for coincidence of two regression lines is
- Fit each set of data with a separate regression line.
- Compute the pooled estimate of the variance around the two regression lines, s2y·xp using the previous equations. This statistic is a measure of the overall variability about the two regression lines, allowing the slopes and intercepts of the two lines to be different.
- Fit all the data with one regression line, and compute the variance around this one “single” regression line, s2y·xs. This statistic is a measure of the overall variability observed when the data are fit by assuming that they all fall along one line of means.
- Compute the “improvement” in the fit obtained by fitting the two data sets with separate regression lines compared to fitting them with a single regression line using
- The numerator in this expression is the reduction in the total sum of squared differences between the observations and regression line that occurs when the two lines are allowed to have different slopes and intercepts. It can also be computed as
- where SSres are the sum of squared residuals about the regressions.
- Compute the ratio of the improvement in the fit obtained when fitting the two sets of data separately over fitting all the data with a single line with the residual variation about the regression lines when fitting the two lines separately, using the F-test statistic,
- Compare the observed value of the F-test statistic with the critical values of F for νn = 2 numerator degrees of freedom and νd = n1 + n2 − 4 denominator degrees of freedom.
If the observed value of F exceeds the critical value of F, it means that we obtain a significantly better fit to the data (measured by the residual variation about the regression line) by fitting the two sets of data with separate regression lines than we do by fitting all the data with a single line. We reject the null hypothesis of a single line of means and conclude that the two sets of data were drawn from populations with different lines of means.
Rheumatoid arthritis is a disease in which a person’s joints become inflamed so that movement becomes painful, and people find it harder to complete mechanical tasks, such as holding things. At the same time, as people age, they often lose muscle mass. As a result, P. S. Helliwell and S. Jackson* wondered whether the reduction in grip strength noted in people who had arthritis was due to the arthritic joints or simply a reflection of a reduction in mass of muscle.
To investigate this question, they measured the cross-sectional area (in cm2) of the forearms of a group of normal people and a group of similar people with arthritis as well as the force (in newtons) with which they could grip a test device. Figure 8-10 shows the data from such an experiment, using different symbols with the two groups of people indicated. The question is: Is the relationship between muscle cross-sectional area and grip strength different for the normal people (circles) and the people with arthritis (triangles)?
We will answer this question by first doing a test for overall coincidence of the two regressions. Figure 8-11A shows the same data as in Figure 8-10, with separate regression equations fit to the two sets of data and Table 8-3 presents the results of fitting these two regression equations. Using the formula presented earlier, the pooled estimate of the variance about the two regression lines fit separately is
Figure 8-11.
In order to test whether the two groups of people (normal subjects and people with arthritis) have a similar relationship between muscle cross-sectional area and grip strength, we first fit the data for the two groups separately (A), then together (B). If the null hypothesis that there is no difference between the two groups is true, then the variation about the regression lines fit separately will be approximately the same as the variation when the two sets of data are fit separately.
Normal | Arthritis | All People | |
---|---|---|---|
Sample size n | 25 | 25 | 50 |
Intercept a (sa), N | −7.3 (25.3) | 3.3 (22.4) | −23.1 (50.5) |
Slope b (sb), N/cm2 | 10.19 (.789) | 2.41 (.702) | 6.39 (1.579) |
Standard error of the estimate sy·x, N | 45.7 | 40.5 | 129.1 |
Next, fit all the data to a single regression equation, without regard for the group to which each person belongs; Figure 8-11B shows this result, with the results of fitting the single regression equation as the last column in Table 8-3. The total variance of the observations about the single regression line is s2grip-areas = (129.1)2 = 16,667 N2. This value is larger than that observed when the two curves were fit separately. To estimate the improvement (reduction) in variance associated with fitting the two lines separately, we compute
Finally, we compare the improvement in the variance about the regression line obtained by fitting the two groups separately with that obtained by fitting them separately (which yields the smallest residual variance) with the F test
This value exceeds 5.10, the critical value of F for P < .01 with νd = 2 and νd = nnormal + narthritis − 4 = 25 + 25 − 4 = 46 degrees of freedom, so we conclude that the relationship between grip force and cross-sectional area is different for normal people and people with arthritis.
The next question that arises is where the difference comes from. Are the intercepts or slopes different? To answer this question, we compare the intercepts and slopes of the two regression equations. We begin with the intercepts. Since the two regressions are based on the same number of data points, we can use the results in Table 8-3 to compute the standard error, the difference in the two regression intercepts with
and
which does not come near exceeding 2.013 in magnitude, the critical value of t for P < .05 for ν = nnormal + narthritis − 4 = 46 degrees of freedom. Therefore, we do not conclude that the intercepts of the two lines are significantly different.
A similar analysis comparing the slopes yields t = 7.367, so we do conclude that the slopes are different (P < .001). Hence the increase in grip force per unit increase in cross-sectional muscle area is smaller for people with arthritis than normal people.
* Helliwell PS, Jackson S. Relationship between weakness and muscle wasting in rheumatoid arthritis. Ann Rheum Dis. 1994;53:726−728.
Linear regression analysis of a sample provides an estimate of how, on the average, a dependent variable changes when an independent variable changes and an estimate of the variability in the dependent variable about the line of means. These estimates, together with their standard errors, permit computing confidence intervals to show the certainty with which you can predict the value of the dependent variable for a given value of the independent variable. In some experiments, however, two variables are measured that change together but neither can be considered to be the dependent variable. In such experiments, we abandon all premise of making a statement about causality and simply seek to describe the strength of the association between the two variables. The correlation coefficient, a number between −1 and +1, is often used to quantify the strength of this association. Figure 8-12 shows that the tighter the relationship between the two variables, the closer the magnitude of r is to 1; the weaker the relationship between the two variables, the closer r is to 0. We will examine two different correlation coefficients.