主页
[ACSESS Publications] Applied Statistics in Agricultural, Biological, and Environmental Sciences ...
[ACSESS Publications] Applied Statistics in Agricultural, Biological, and Environmental Sciences  Chapter 2: Analysis of Variance and hypothesis testing
McIntosh, Marla S.你有多喜欢这本书？
下载文件的质量如何？
下载该书，以评价其质量
下载文件的质量如何？
年:
2018
语言:
english
DOI:
10.2134/appliedstatistics.2016.0009
文件:
PDF, 2.32 MB
你的标签:
在15分钟内，文件将被发送到您的电子邮件。
在15分钟内，文件将被送到您的kindle上。
注意事项: 您需要验证你发送到Kindle的每本书。检查你的收件箱，查看来自Amazon Kindle Support的确认邮件。
注意事项: 您需要验证你发送到Kindle的每本书。检查你的收件箱，查看来自Amazon Kindle Support的确认邮件。
关联书单
0 comments
您可以留下评论，分享你的经验。其他读者也会有兴趣了解您对您所读书籍的看法。不管你喜不喜欢这本书，只要您如实、详细地告诉他们，大家就能找到感兴趣的新书。
1

Published online August 23, 2018 Chapter 2: Analysis of Variance and hypothesis testing Marla S. McIntosh* Abstract This introductory chapter offers a summary review and practical guidance for using analysis of variance (ANOVA) to test hypotheses of fixed effects. The target audience is agricultural, biological, or environmental researchers, educators, and students already familiar with ANOVA. Descriptions of ANOVA, experimental design, linear models, random and fixed effects, and ANOVA components are presented along with discussions of their roles in scientific research. A case study with data provided that is balanced and normallydistributed is used to illustrate ANOVA concepts and offer handson ANOVA experience. The case study involves a factorial experiment conducted at three locations analyzed using a mixed model approach. Analysis of variance results are explained and used to construct ANOVA tables that effectively communicate key details to verify that the statistical design, analysis, and conclusions are appropriate and valid. The goal of this chapter is to: i) empower readers to make informed decisions to choose appropriate experimental designs, statistical estimates, and tests of significance to address the research objectives; and ii) provide advice on presenting statistical details in research papers to ensure that the experimental design, analysis, and interpretation were valid and reliable. Research in agricultural, biological, and environmental science disciplines is a major contributor to our basic scientific understanding that can lead to new and improved technologies, safer and healthier food supply, and best practices for sustaining our ecosystems. And it can be argued that we owe much of the overall success of research in these disciplines to the widespread use of analysis of variance (ANOVA) that began the modern era of applied statistics. Analysis of variance was initially conceived for agricultural researchers conducting field experiments to determine whether differences between ; treatments were significant (i.e., reliable and repeatable) rather than an artifact of variable environmental conditions (Fisher, 1926). Subsequently, the classical ANOVA concepts Abbreviations: ANOVA, analysis of variance. M.S. McIntosh, Professor Emerita, Department of Plant Science and Landscape Architecture, University of Maryland, College Park, Maryland, 20742. *Corresponding author (mmcintos@umd.edu) doi:10.2134/appliedstatistics.2016.0009 Applied Statistics in Agricultural, Biological, and Environmental Sciences Barry Glaz and Kathleen M. Yeater, editors © American Society of Agronomy, Crop Science Society of America, and Soil Science Society of America 5585 Guilford Road, Madison, WI 537115801, USA. 19 20 McIntosh have been built on, refined, and adapted across disciplines leading to new ANOVA methodologies that have substantially extended the potential scope and application of ANOVA. In this chapter, we begin with classical ANOVA concepts and terminology to serve as the foundation and give context for understanding the contemporary ANOVA developed for mixed models. A case study provides a realworld example of ANOVA based on an RCBD factorial experiment conducted at three locations. The case study analyzes a balanced data set that meets the classical ANOVA assumptions. The ANOVA results are described and readers are encouraged to replicate the ANOVA and/or devise their own analyses. A broad overview and practical guidance on effective use of ANOVA is intended for scientists, educators, and students who have a basic understanding of statistics and experimental design. The ANOVA Process The experimental process begins with a researcher interested in testing a hypothesis that, according to the scientific literature and the researcher’s opinion, is reasonable and justifiable. Based on the experimental objectives and scope of the inferences desired, the research team determines appropriate treatments to be applied and dependent variables to be measured. Subsequently, the experimental design, including the number and size of experimental and sampling units, is planned (often with the help of a statistician). The experimental design should provide the desired level of precision and power balanced by the perceived limits on resources, time, effort, and expense [see Chapters 1, 3, and 4 (GarlandCampbell, 2018; Casler, 2018a,b]. A linear model is then proposed to explain the variation observed in the dependent variable and a suitable statistical method is chosen to conduct the analysis. Once the plans for the experiment are complete and thoroughly reviewed, the experiment is conducted and data carefully recorded along with field or laboratory notes. After the data are collected and checked for errors and outliers, the statistical analysis is conducted. The statistical results are often found within seemingly endless computer output filled with numbers that appear to be so precise that they are presented with ten digits. The output includes all kinds of statistics, parameter estimates, and probabilities and it is up to the research team to use the appropriate statistics to make sense of the data and arrive at valid conclusions. The experimental process as described illustrates that statistics are not something that happens after an experiment is completed, but a component in every step of the process. Also, the decisions related to the design and analysis of the experiment can be complex and interdependent, requiring a scientific and practical understanding of the populations being investigated as well as a working knowledge of ANOVA concepts and procedures. As too many researchers have learned by experience, data analysis mistakes can be corrected, but poor experimental design choices can be irreversible and require repeating all or part of an experiment. The point is that statistics matter and can determine whether your research accomplishes its objectives and your conclusions will stand the test of time or even sees the light of day. Obviously, the impact of the research is greatly magnified if published in a peerreviewed, highquality scientific journal. These journals have their own standards, which invariably include scientific merit. Welldesigned and statistically A n a ly s i s of Va r ian ce an d H yp o t hesis Test in g sound experiments are integral to scientific quality and integrity. Thus, scientific journals often require that papers describe the statistical design and analyses in order to verify they are correct and appropriate for the stated objectives and conclusions. Metaanalysis has shown that statistical significance often determines if a study is publishable and sets a path for future research (Mervis, 2014). This can have negative consequences. Depending on the effect tested, knowing that an explanatory effect is not significant can be an important discovery (e.g., chemical Y does not significantly affect beneficial insect counts) or can discourage further research on ineffective treatments (e.g., chemical Y does not significantly affect targeted pest insect counts). On the other hand, an effect deemed not significant as a result of a Type 2 error should not be published, but the Type 2 error rate is rarely known and can be surprisingly large. Thus, if a paper with nonsignificant results is published, it should include a justification for the adequacy of the experimental design and power of the tests. History of Analysis of Variance Analysis of variance was introduced at a time when agronomists were desperately seeking a statistically sound technique to improve the credibility of field researchers who were (and still are) plagued by inherent variations caused by uncontrolled factors that can mask or exaggerate treatment effects. Agricultural researchers needed a scientific method to design and analyze data that would provide a measure of confidence that the measured differences in yield between treated plots and their controls were reliable and repeatable. Agronomic researchers were all too aware of the vagaries of field variation that often overshadowed the actual effects of treatments and R.A. Fisher developed ANOVA as a statistical method to address this problem. Fisher was hired by the Rothamsted Research Station to statistically analyze data and observations collected from continuous research on wheat that was conducted for over 70 yr to determine the causes of variation in yields (Fisher, 1921, Fisher and Mackenzie, 1923). Making use of the extensive longterm field and weather data and notes, Fisher developed ANOVA to estimate and compare the relative contributions of various factors (e.g., weather, soil fertility, weed cover, and manure treatments) to the observed variations in yield. Over the next few years, Fisher further developed the principles and practices for ANOVA and wrote the groundbreaking classical statistics treatise, “Statistical Methods for Research Workers” (Fisher, 1925). This book, written for agronomic researchers, educators, and students rather than statisticians, set the statistical standards for the design and analysis of field experiments. Subsequently, ANOVA has been adapted for statistical testing of experimental data throughout most scientific disciplines. As its use has grown, so has its theory and applications. Fisher used probability distributions to determine if treatment means were “significantly” different or different by random chance. The ANOVA technique estimated the probability that the means were not different, assuming the population(s) of means were normally and independently distributed with equal error variation. Early agronomic researchers routinely replicated treatment plots to measure plottoplot variability and calculate a standard deviation (SD) for each treatment mean. However, researchers typically designed experiments where control plots were replicated to be compared with adjacent treated plots to minimize error variation. Fisher successfully demonstrated that this practice led to inherently biased estimates of error variation 21 22 McIntosh and argued that ANOVA requires that treatments and control plots must both be replicated and randomized. This new understanding of the importance of randomization and replication on estimating error variance was arguably the most revolutionary aspect of ANOVA that advanced applied statistics by exploiting the power of experimental design. Analysis of variance brought a new age of field experimentation that fostered the creation of experimental designs to reduce random error and gain power by increasing the number of replications and/or samples. In the basic ANOVA process, sources of variation are identified, estimated, and compared to test null hypotheses based on statistical probability. The ANOVA addressed the most pressing concern of agricultural researchers by providing a scientifically sound and systematic process that could be taught to researchers so they could: i) design experiments to incorporate the ANOVA principles of treatment replication and randomization; ii) identify the sources (factors) that contribute to the total variation of the data; iii) analyze data using a simple stepbystep procedure to calculate the variation contributed by different sources; and iv) use the ratio of the variation among treatments to the variation within treatments to test the significance of a treatment effect. Unfortunately, the mathematical simplicity that makes classical ANOVA practical and appealing to researchers comes with limitations. The theoretical probabilities used to test the significance of the differences among means are based on assumptions that the means being compared are from normallydistributed populations and their error variances are independent and equal. However, Box (1976) pointed out that “in nature there never was a normal distribution, there never was a straight line, yet with normal and linear assumptions, known to be false, he can often derive results which match, to a useful approximation, those found in the real world”. Thus, in practice, the assumptions of normality and homogeneity of variance are intended as approximate. Fortunately, based on the Central Limit Theorem, if a dependent variable is not normally distributed, the distribution of its means will be normal given a sufficiently large sample size. Also, there is substantial evidence that minor deviations from the basic assumptions have little impact on ANOVA results (Acutis et al., 2012). However, deviations from these assumptions that are not trivial can lead to invalid probabilities for tests of significance and improper inferences. In response to the shortcomings of classical ANOVA, newer approaches to ANOVA for fitting models, estimating variance and/or covariance parameters, and for testing significance of effects have been developed that are not conditional on the classical ANOVA assumptions. Contemporary ANOVA methods that employ computerintensive maximum likelihood estimation rather than the classical method of moments have been designed for analysis of mixed models containing both fixed and random effects. More recently, generalized linear mixed model analysis has been developed for mixed model analysis of response variables that are not necessarily normallydistributed. The newer statistical methods have expanded capabilities and applications that allow ANOVA to address the modernday challenges posed by “big data” and understanding complex systems (Gbur et al., 2012). Statistical software is widely available for scientists and educators to utilize generalized linear mixed model analysis for ANOVA of response variables of various distributions (e.g., normal, binomial, Poisson, lognormal), explanatory effects that are continuous or A n a ly s i s of Va r ian ce an d H yp o t hesis Test in g categorical, and models that are fixed, random, or mixed. However, the flexibility and power of generalized model fitting also requires a level of statistical skill and expertise far above the classical methods to effectively and correctly perform and interpret the ANOVA. In this chapter, the statistical analyses are limited to continuous and normally distributed response variables that are fitted to a linear mixed model as implemented using PROC MIXED (SAS 9.4). The analysis of nonGaussian (nonnormal) and categorical data is best analyzed using a generalized linear model approach, which is the focus of Chapter 16 by Stroup (2018). Planning an Experiment A successful experiment, one that provides statistically sound evidence to reach correct conclusions about hypotheses, requires careful and skillful planning. The strength and correctness of an experiment is greatly influenced by the many decisions made during the planning process. Thus, before walking or running (but never jumping) to conclusions, welldefined experimental goals and objectives should be identified to enable the researcher to make informed choices to optimize the experimental design and analysis. These choices require considerations of experimental space and time, number and arrangement of experimental and sampling units, precision and suitability of laboratory and/or field techniques, equipment and personnel availability and costs, and the available and future funding levels. Keeping the research goals in mind, the researcher strives for a desired level of power for ANOVA tests in view of the practical considerations of resources, time, effort, and expense. The final plan should ensure that the results of the data analysis lead to statistically valid and convincing evidence to support the research conclusions. Table 1 contains a checklist of points useful for evaluating the quality and validity of the experimental design and analysis. Even when collaborating with a statistician, researchers should take an active role in determining the best experimental design and analysis. Choices determining the quality of both the data and the statistical tests are made at each stage of designing and analyzing an experiment. These choices require both statistical expertise and a scientific and practical understanding of the populations being investigated. Regardless of who designs and analyzes the experiments, the primary researchers and/or authors are responsible for the integrity of the research. In other words, the statistical aspects of research should not be viewed as a separate endeavor to be handed over to a statistician but rather as a collaborative and highly interactive effort. To ensure a successful outcome, joint consultation beginning at the planning stages is essential. The Linear Model The linear model is the mathematical framework for ANOVA and provides a succinct description of the effects (known and postulated) that contribute to the total variation of the observations. A linear equation is used to specify all of the terms in the model. The simplest linear model outlines an experiment with treatments replicated in a completely randomized design (CRD), where the value of an observation is equal to the overall mean of the observations plus the effect of its treatment mean and its random error (Eq. [1]). 23 24 McIntosh Table 1. Checklist for evaluating the quality and validity of ANOVA. Experimental and treatment design 1. Description of experimental and treatment designs include sufficient detail to be judged or repeated. 2. Experimental units of the treatments are randomized and replicated. 3. Replications of experimental units and sampling units are adequate. 4. Blocks, sites, years, sampling units considered to be random effects have adequate replication. 5. Designs such as splitplot and sampling designs that result in multiple error terms are mentioned. 6. Factorial treatment designs are used effectively to test relationships of treatments and add power to tests. Statistical analyses 1. Statistical analyses are described in sufficient detail to be evaluated or repeated. 2. Quality and quantity of data is evident. 3. Statistical methods are appropriate for data and objectives. 4. Theoretical assumptions are validated or justified. 5. Pvalues for tests of significance are used correctly. 6. Power of tests is sufficient to meet research objectives. Hypothesis testing 1. Tests of significance are meaningful and preferably not confounded. 2. Effects are classified as fixed or random to determine whether the inference space (narrow, intermediate, or broad) and parameters of interest are appropriate for research objectives/questions. 3. Inferences for fixed effects are limited to the range of levels of the effect in the experiment. The parameters of interest are means. 4. Inferences for random effects are broad and refer to the population of all possible levels of the effect. The parameters of interest are variances. 5. Random effects are estimated with at least five degrees of freedom if used as error terms or test of significance will lack power, resulting in a high probability of a Type 2 error. 6. For factorial treatment designs, tests of significance (fixed effects) or variance estimates (random effects) are conducted for main effects and interactions. 7. Treatment means that are not structured are compared or rated using an appropriate multiple comparison procedure, usually a least significant difference (LSD) which is equivalent to multiple ttests. Y = overall mean + treatment effect + random error [1] The linear model of an experiment becomes more complicated as experimental and treatment design factors are added. And the more complicated the model, the more helpful the linear model is for describing an experiment. The format of a linear model is flexible and can be adapted for specific uses and users such as: i) statistical textbooks on ANOVA and its mathematical foundation useful for teachers and students (Eq. [2]); ii) papers on ANOVA theory and application in journals intended for statisticians and mathematicians (Eq. [3]); iii) papers on research analyzed using ANOVA techniques in journals intended for scientists (Eq. [4]). Examples of different formats of linear models are found throughout this book. Here, a linear model for an RCBD factorial experiment with two treatment factors 25 A n a ly s i s of Va r ian ce an d H yp o t hesis Test in g is shown in three different formats Eqs. [2–4]. This linear model includes effects associated with the experimental design (block and random error) and the factorial treatment design (A and B main effects and A × B interaction). Y = overall mean + block effect + A main effect + B main effect + A × B interaction + random error [2] Yijk = µ + ri + Aj + Bk + ABjk+ eijk: [3] Yijk = the observation at the ith block, jth level of A, and kth level of B; µ = the overall mean; r i = the effect of the ith block, Aj = the effect of the jth level of A, Bk = the effect of the kth level of B, ABjk = the interaction effect of the jth level of A and kth level of B and e ijk = the residual error of the ith block, jth level of A, and kth level of B. Y = Block + A + B + A×B + Error [4] Equation [2] uses general descriptive statistical terms to explain that the dependent observation is equal to the overall mean plus the effects of the block, treatment factors, and experimental error. Although lacking detail, Eq. [2] defines the effects tested for significance (A, B, and A × B) as well as the “nuisance” effects (Block and Error) associated with the experimental design. Equation [3] uses a format common to mathematicians and statisticians in which a Greek or Latin letter represents each effect and subscript notation characterizes the effect level. This format provides a level of detail, clarity, and flexibility that is appropriate for graduate level statistical textbooks, papers in statistical journals, and experiments with complex or unorthodox linear models. However, this format may contain a level of detail that can overwhelm and confuse introductorylevel students and nonstatisticians. The major drawback, even for statisticians, is that the symbols are not standard. Thus, the symbol and subscript for each effect in the equation must be defined within each publication. For complicated models with many terms this can be tedious and timeconsuming to follow especially when comparing models between publications. Equation [4] is based on the syntax of the MODEL statement used by PROC GLM except that it includes the experimental (residual) error. This format is the most parsimonious and does not include the overall mean (µ) which is a constant and does not contribute to the variation of the dependent variable. Fixed and Random Effects Each effect in the linear model is considered to be either fixed or random. Determining whether an effect should be fixed or random is crucial because fixed and random effects satisfy different experimental objectives, have different scopes of inference, and lead to different statistical estimators and test statistics. Fixed effects are treatment factors that include all levels or types of the effect and are used to make narrow inferences limited to the treatments in the experiment and tests of their significance. Fixed effects are usually the treatment (explanatory) effects being investigated, such as fertility treatments, fertility rates, cultivars, and temperature levels. For fixed effects, the estimates of the treatment means and tests of significance of effects and 26 McIntosh differences between means are of primary interest. In contrast, the levels or types of a random effect represent random samples of all possible levels and types of the effect. Random effects are used to make broad inferences about the general variation caused by the random effect that is not limited to the levels of the effect included in the experiment. Effects that are associated with years, locations, or genotypes are often considered random if they constitute a sufficiently large random sample of the defined population [See Vargas et al. (2018) Chapter 7]. For random effects, inferences about their variation are of primary interest rather than differences between means. Effects associated with design factors and error variation such as blocks, experimental and sampling variation should always be random. Although it is usually readily apparent whether an effect is fixed or random, sometimes the effect does not neatly or exclusively fit the criteria for either random or fixed effects. Categorizing years and environments as fixed or random effects is particularly problematic and controversial for analyses of multienvironment experiments. Rationale and significance for categorizing years and locations for combined analysis has been explained by Moore and Dixon (2015) and in Chapter 8 by Dixon et al. (2018). Also, a comprehensive and practical review of mixed models by Yang (2010) provides criteria and other useful information for categorizing fixed and random effects. As with many statistical decisions, deciding whether effects are fixed or random may be subjective and require sound judgement based on scientific expertise and expectations of the researcher in addition to statistical considerations. When uncertain, expert advice and assistance of a statistician may be sought to ensure that the fixed and random effects are analyzed and interpreted correctly. Fixed, Mixed, and Random Models A linear model is categorized as fixed, random, or mixed based on the types of effects (other than the residual error) contained in the model. Fixed and random models contain all fixed or all random effects, respectively. A mixed model contains both fixed and random effects. Classical ANOVA procedures were developed for fixed and random models that are calculated using the mathematically simple method of moments and least squares estimation. Using classical least squares methods, the GLM procedure (SAS) was developed for fixed models to test significance of fixed effects and the VARCOMP procedure was developed for random models to estimate variance components. More recently, PROC MIXED was developed to properly analyze mixed models of normallydistributed data and PROC GLIMMIX to analyze generalized mixed models of data that are not necessarily normallydistributed. The mixed model approach treats random and fixed effects differently and can incorporate covariance structures for fitting models (Littell et al., 2006). PROC MIXED is flexible and appropriate for fitting normallydistributed dependent variables to mixed, fixed, or random models and designs with correlated errors such as repeated measures and split plots. For ANOVA of balanced, normallydistributed fixed models, the PROC GLM and PROC MIXED outputs look different but the statistical results are identical. However, PROC GLM does not automatically use the correct error term for Ftests of models with random designs or treatment effects so the MIXED procedure is recommended for most ANOVA analyses (Yang, 2010). A n a ly s i s of Va r ian ce an d H yp o t hesis Test in g ANOVA COMPONENTS Despite the new tools added to the ANOVA toolbox, the original ANOVA components are still useful for understanding ANOVA practice and procedures in a modernday context. These components are: sources of variation (SOV), degrees of freedom (df), sums of squares (SS), mean squares (MS), Fratios, and the probability of a greater Fvalue (P > F). Each component is associated with a step in the least squares calculations used to estimate variances and test significance of effects. The following is a brief explanation of each ANOVA component and its informational value. Sources of Variation Sources of variation provide the foundation for the ANOVA process and each SOV represents a term the linear model. Therefore, rather than presenting the linear model of a factorial RCBD ANOVA as an equation (Eqs. [2–4]), the same effects can be shown as SOV in an ANOVA table. By including all effects (random and fixed; treatment and design) in an ANOVA table, readers and reviewers can easily and unambiguously understand and judge the validity and appropriateness of the linear model. Correct accounting of the linear model effects is the most important component of ANOVA since every step of the ANOVA procedure depends on it. The SOV summarize the essential ingredients of an experiment including the experimental, sampling, and treatment designs. Degrees of Freedom After the SOV are identified, their df are calculated. The df is the number of independent differences used to calculate the ANOVA components for each SOV. The df for a SOV conveys the size and scope of an experiment and further clarifies the experimental, sampling, and treatment designs deduced from the SOV. The df in an ANOVA table are especially helpful when describing experiments with complex designs and analyses because df can be easily translated into the numbers of replications, factors, experimental units, and sampling units of the experiment, as well as the number of levels of each factor. Moreover, df provide insight into the adequacy of the statistical tests of the data. In general, error terms with few df indicate low power, poor repeatability of the Ftest, and a high probability of a Type 2 error [See Chapter 1 by GarlandCampbell (2018) for a discussion of statistical errors]. In designing experiments, it is often preferable but not always practical or possible to have an equal number of observations in each cell. For experiments with unbalanced data, the df of the random SOV are not independent and should be adjusted using the Satterthwaite or KenwardRogers methods (Spilke et al., 2005). In cases with insufficient data for valid hypothesis testing, researchers should consider confidence interval estimation and data visualization techniques (Majumder et al., 2013) or conduct additional experiments to obtain adequate data needed to meet their research objectives. Sums of Squares The SS partition the total variation into the variation contributed by each SOV. In general, the difference between each effect mean and the overall mean is squared and then totaled. Given that the SOV are independent of each other, the SS of each 27 28 McIntosh SOV sum to the total SS. The SS are used to determine the percent of the variation accounted for by individual terms in the model and play a primary role for fitting models and determining goodness of fit. For testing significance of effects, the SS are an intermediary step for calculating mean squares. Mean Squares The average of a SS (SS/df) is the MS, which estimates the average variation associated with a SOV. A MS is used to construct Fratios for testing significance of effects and to estimate the variance of a random effect. Both usages are based on the concept that a MS is a sample statistic that estimates a population parameter known as an expected mean square (EMS). In turn, an EMS is a linear function that contains one or more variance components, which are population parameters of interest. Later in the chapter, examples will be used to demonstrate the role of EMS for estimating variances and significance testing. The MS of the error (MSE) and its derivatives play a fundamental role in descriptive statistics as measures of variation within normallydistributed populations (se2, experimental error, residual variation). The MSE is in squared units and can be difficult to interpret in the context of the observations. Instead, the standard deviation (SD), calculated as MSE , is often the preferred statistic since it is in the same units as the observations. Another common statistic, the standard error (SE), also referred to as the standard error of the mean (SEM), is similar to and often confused with the SD. The key difference between these two statistics is that the SD describes the population distribution of individual samples, whereas the SE describes the population distribution of sample means. The relationship between these statistics is SE = SD/ r = MSE/r , where r = the number of replications estimating the mean. The SE is used to estimate the precision and confidence intervals of means and commonly is presented with the mean as Y ± SE . Another related statistic is the standard error of the difference (SED) = 2 × MSE/r , which is used for t tests and multiple comparison procedures such as the LSD (least signficant difference) to test for significant differences between pairs of means. An LSD procedure is the same as multiple ttests to determine whether the difference between any pair of means is significant at a chosen a level. The LSD(a) = t a (SED). If the difference between two means is greater than or equal to the LSD value, then those two means are considered to be significantly different. Note that the LSD and multiple ttests use the SED based on the MSE from the ANOVA. As seen in their calculations, the SE and SED decrease as the number of replications increase. This improves the precision of mean estimation, the power of the test, and reduces confidence intervals. Although increasing the number of replications does not necessarily increase or decrease the MSE or SD, sufficient replication is critical for an MS to be a reliable and unbiased estimate of variation. As previously noted, based on the Central Limit Theorem, even if the population of individuals is not normallydistributed, the population of means will be normal if the mean sample size is sufficiently large. Consequently, it is tempting to use a large number of replications but it is not always wise. In planning an experiment, it is important to determine, or at least estimate, the number of replications needed for a valid estimate of MSE 29 A n a ly s i s of Va r ian ce an d H yp o t hesis Test in g and to detect meaningful differences, recognizing the realworld and practical constraints of time and resource limitations. Fvalues An Fstatistic (also termed Fratio or Fvalue) is a ratio of MS’s used to compare variances. To test the null hypothesis that the treatment variance is not significantly different from zero, an Fstatistic is constructed that compares the Treatment MS to the Error MS. This Fstatistic is analogous to a signaltonoise ratio, where the signal is the effect and the random variation is the noise. The Treatment MS, calculated from the deviations of the treatment means from the overall mean, contains variations due to both the signal and the noise. The Error MS, calculated from the deviations of the observations from their treatment mean, contains only variation due to the noise. Thus, the basic form for the F = signal + noise . noise If the null hypothesis is true, the signal (effect) is zero and F reduces to one. The differences among the means are found to be significant when they are larger than would be expected from random variation. The MS’s that comprise the numerator and denominator of the Fstatistic are based on the variance components contained in their EMS (Table 2). Here, the Fstatistic is Var (Residual ) + r Var (Treatment) the Treatment MS which is an estimate of . Var (Residual ) Error MS The probability density (frequency) distribution of Fstatistics, known as the Fdistribution, is a function of the df of the numerator (ndf), the denominator (ddf), and the hypothesis effect size. If the null hypothesis is true, the effect size, by definition, is equal to zero and the Fstatistics follow a central Fdistribution. For alternate hypotheses that propose an effect size of interest, the Fstatistics follow a noncentral Fdistribution that is a function of the ndf, ddf, and the chosen effect size. The effect size is often given in relative terms as the noncentrality parameter (λ) calculated as the Treatment SS/Error MS. The critical Fvalue is at the point on the central Fdistribution where the P>F is the chosen significance level. Prior to computers, researchers relied on Ftables containing critical Fvalues for Fdistributions of varying ndf and ddf for common significance levels (e.g., 0.10, 0.05, 0.01). Ftables are relics of the past but are still useful for understanding the relationships between df, Fvalues, and pvalues. For example, Table 3 Table 2. Expected mean squares of RCBD experiment with factors A and B from SAS (PROC MIXED Method=Type3). Fixed Model AFixed, BFixed Source Expected Mean Square Mixed Model ARandom, BFixed Error Term Expected Mean Square Error Term Blk Var(Residual) + tVar(Blk)† MS(Residual) Var(Residual) + tVar(Blk) A Var(Residual) + Q(A,A×B) MS(Residual) Var(Residual) + rVar(A×B) MS(A×B) + rb Var(A) B Var(Residual) + Q(B,A×B) MS(Residual) Var(Residual) + rVar(A×B) MS(A×B) + Q(B) A×B Var(Residual) + Q(A×B) MS(Residual) Var(Residual) + rVar(A×B) MS(Residual) Residual Var(Residual) MS(Residual) Var(Residual) † Var and Q are variances of random and quadratic functions of fixed effects, respectively. r=number of replications, t=number of treatments, and b=number of levels of B. 30 McIntosh Table 3. Critical Fvalues for different degrees of freedom for pvalues ≤ 0.01 and ≤ 0.05. Den df 2 3 4 5 10 20 120 1 Critical Fvalues P ≤ 0.01 Critical Fvalues P ≤ 0.05 Numerator df Numerator df 2 3 4 99.0 34.1 30.8 21.2 18.0 16.7 16.3 13.3 12.1 11.4 10.0 7.6 6.5 6.0 8.1 5.8 4.9 4.4 6.8 4.8 3.9 3.5 5 10 5.6 4.1 3.4 3.2 2.5 20 Den df 2.0 2 3 4 5 10 20 120 1 2 3 4 19.0 10.1 7.7 6.6 5.0 4.3 3.9 9.5 6.9 5.8 4.1 3.5 3.1 6.6 5.4 3.7 3.1 2.7 5.2 3.5 2.9 2.5 5 10 20 3.3 2.7 2.3 2.3 1.9 1.3 shows the impact that ndf and ddf have on the critical Fvalues. Critical Fvalues with only 1,2 df are very large. However, critical Fvalues decrease exponentially as the ddf increase becoming relatively stable and reliable at five ddf for a = 0.05 and ten ddf for a = 0.01. When interpreting tests of significance, it is important to recognize that a MS estimated with few df tends to be inaccurate and imprecise, resulting in large critical Fvalues and Ftests lacking repeatability and power. It is especially important to recognize that some tests of effects within experiments with multiple error terms, such as split plot experiments and experiments combined over a limited number of environments or years, often have few ddf and a higher risk of a Type 2 error. There are related concerns about the precision of estimates of variance components of random effects, and it is recommended to analyze random effects having less than 10 levels as fixed effects (Piepho et al., 2003, Yang, 2010) [See Chapter 7 (Vargas et. al, 2018) for a detailed discussion on levels of random effects]. On the other hand, Fvalues based on sufficient df are robust and minor deviations from normality or homogeneity of variance have a negligible impact on Fdistributions and do not invalidate Ftests (Acutis et al., 2012). For experiments with heterogeneous error variances, mixed model approaches based on maximum likelihood estimates that can detect and fit different covariance structures are recommended. The influence of the ndf and ddf on the shape and spread of the Fdistribution is illustrated in Fig. 1, where the central and noncentral Fdistributions are shown for a small (ndf = 1, ddf = 4), a medium (ndf = 5, ddf = 20), and a large (ndf = 20, ddf = 100) experiments. For the small experiment (ddf = 4), the Fdistribution is highly skewed towards 0 and displays extreme kurtosis. As the df increase as seen for the medium (ddf = 20) and large (ddf = 100) experiments, the central Fdistribution becomes less dispersed with a higher density of Fvalues near the true F=1. Similarly, the noncentral Fdistribution also has a higher density and is less dispersed as the df increase. When both the central and noncentral distribution curves are less dispersed, their overlap is also less. This translates into a more powerful Ftest. The overlap between the Fdistrubtions of the null and alternate hypotheses also depends on the effect size. In general, as the effect size of the alternate hypothesis increases, the noncentral Fdistribution shifts to the right, becomes more dispersed, and the overlap between the null and alternate Fdistributions decreases (not shown). The relevance of the overlap between the central and noncentral Fdistributions to hypothesis testing is discussed in Chapter 1 (GarlandCampbell, 2018) and later in this chapter. A n a ly s i s of Va r ian ce an d H yp o t hesis Test in g Although not common practice, Fvalues can be used to assess and compare the relative magnitude of fixed effects (McIntosh, 2015). If an effect is null, the variance component of the effect is 0 and the numerator and denominator estimate the same expected mean square value and the theoretical F (“true” F) is 1. Therefore, (F minus 1, F1) estimates the magnitude of the variance due solely to the effect variance relative to its error variance. Since (F1) increases as the effect size and variance component increases, the Fvalue calculated from MS with adequate df should increase as the size of the “true” effect increases. Thus, a ratio of Fvalues can be used as a simple and quick tool to compare the magnitudes of effects and to indicate their relative importance. As an informal and exploratory statistic, this “ratio of Fratios” provides a rudimentary quantitative assessment of the relative magnitudes of effects to augment the qualitative tests of effect significance. The “ratio of Fratios” for comparing a main effect to an interaction, calculated as F(main effect)/F(interaction), can help the researcher to decide whether to conduct postplanned comparisons between main effect or interaction means. This convenient ratio does not have the drawbacks noted by Saville (2015) from relying on the significance of the interaction to determine whether to conduct an LSD to compare means. P value A pvalue is the probability of a test statistic (F, t, or c2) equal to or greater than the calculated value, given that the null hypothesis is true. For ANOVA, pvalues for the calculated Fstatistics are determined using the appropriate central Fdistribution. An appropriately small pvalue is chosen to guard against incorrectly rejecting a null hypothesis and to ensure scientific credibility when claiming differences between means. During the early development of ANOVA, Fisher recommended using p £ 0.05 as a reasonable and convenient pvalue to determine statistical significance, which has since become a rote practice. Also by common convention, pvalues of 0.05, 0.01, and 0.001 are designated as Fig. 1. Central (solid lines) and noncentral (dashed lines) Fdistributions (l=10) for selected numerator and denominator degrees of freedom. 31 32 McIntosh significant (*), very significant (**), and highly significant (***), respectively. An alternate approach to significance testing, championed by Neyman, considers two types of statistical errors (Neyman and Tokarska, 1936). Incorrectly rejecting a null hypothesis and falsely declaring a significant difference is a Type 1 error. A second type of error, a Type 2 error occurs when a false null hypothesis is accepted and the effect is not found to be significant. The test of significance uses a pvalue from the central Fdistribution to place a fixed limit on the Type 1 error rate (a). Whereas the Type 2 error rate (b) is based on the cumulative probability from the noncentral Fdistribution of the alternate hypothesis, which is the area to the left of the critical Fvalue. Thus, a and b are inversely related and reducing the Type 1 error rate (decreasing the pvalue used to determine significance) will increase the Type 2 error rate. It is important that researchers choose a Type 1 error rate (or significance level) that also balances the relative risks of Type 2 errors. However, since b is unknown and changes with sample size, error variance, and effect size, the de facto significance level for a is the pvalue of 0.05, not coincidentally the same as suggested by Fisher (Lehmann, 1993). In Chapter 1, GarlandCampbell (2018) provides a thorough discussion on this topic. Fisher’s pvalues and Neyman’s alevels represent rival statistical philosophies regarding significance testing. Fisher’s focus was on scientific inquiry and using significance tests to help the researcher draw conclusions to understand and learn from the experimental data. In contrast, Neyman focused on making correct decisions about rejecting or accepting the null hypothesis in relation to the relative seriousness of Type 1 and Type 2 errors. These two conceptual views of significance, similar yet different, are commonly conflated and used interchangeably. The consequence has contributed to misconceptions and misapplications of pvalues (Nuzzo, 2014). Regardless of their shortcomings, pvalues are ubiquitous throughout the scientific literature. They are used as the universal statistic to convey confidence in the conclusions about the experimental results. Scientific journals have grown to rely on pvalues as the deciding factor that separates scientific evidence from anecdote. Meanwhile, statisticians have become increasingly concerned about the impact that frequent misuse and misinterpretation of pvalues have on scientific integrity and progress. The longstanding debate over the proper role of pvalues has become increasingly heated, prompting the American Statistical Association to issue a policy statement with supplemental commentaries to address major concerns associated with the uses of pvalues (Wasserstein and Lazar, 2016). The following is a list of fundamental characteristics of pvalues that are often overlooked. 1. A pvalue is an inferential statistic that estimates the parameter P (the true probability). Just like treatment means are calculated from samples of populations to estimate the population mean, pvalues are estimated using the distribution of the samples that represent the population distribution. In fact, pvalues are subject to variation that can be surprisingly and disappointingly large. Based on simulations of typical data situations, standard errors of mean pvalues between 0.00001 and 0.10 typically range from 10–50% of the mean and only the magnitude of a pvalue is welldetermined (Boos and Stefanski, 2011). Thus, using a strict cutoff such as 0.05 as a dividing line for significance is problematic since a pvalue is not necessarily replicable (Amrhein et al., 2017). A n a ly s i s of Va r ian ce an d H yp o t hesis Test in g Fig. 2. Distributions of pvalues for null hypothesis (µ=0) and alternate hypotheses (µ=0.05 and µ=0.1) based on 10,000 simulated ttests (Murdoch, et al., 2008). 2. Pvalues are random variables and the distribution of pvalues for the null and alternate hypotheses depends on the true values of the treatment means (µ). This is illustrated (Fig. 2) with histograms of simulated pvalues generated from t tests where the true means are equal (µ = 0), differ by half the SD (µ = 0.5), and differ by one SD (µ = 1) (Murdoch et al., 2008). If the null hypothesis is true (treatment means are equal), the distribution of pvalues is flat and evenly distributed from 0 to 1. In contrast, if the alternate hypothesis is true (the treatment means are not equal), the distribution of pvalues is skewed toward 0. As the difference between the means increases, the pvalues cluster nearer to 0, showing that the power of the t test increases as the difference between means increases. 3. Small pvalues, those pvalues that confer significance are the least reliable. Fdistributions when the null hypothesis is true are highly skewed resulting in pvalues being quite insensitive at the tail end of the Fdistribution. This is illustrated in the nonlinear exponential relationship between F and pvalues shown in Fig. 3 for Fvalues with 1,20 df. This is also demonstrated by comparing the differences between the critical Fvalues at different pvalues in Table 4. For example, the difference between the critical Fvalues at the 0.10 and 0.05 significance levels is small (1.38) compared with the large difference (10.87) between the critical Fvalues for significance at the 0.0010 and 0.0001 significance levels. 4. The pvalue for a given Fvalue decreases as sample size increases (Fig. 4). If sample size is small, large differences may not be statistically significant producing a Type 2 error. Conversely, if the sample size is very large, even trivial effects can be statistically significant, which can be misinterpreted to infer biological significance. 5. A small pvalue is used as a criteria to reject (disprove) a null hypothesis. If a pvalue £ 0.05 Fig. 3. Pvalue vs. Fvalue for 1,20 degrees of freedom. 33 34 McIntosh Table 4. Critical Fvalues at pvalues ranging from 0.1 to 0.0001. Critical F value Pvalue 0.1 0.05 0.01 0.001 0.0001 1,2 df 1,20 df 8.5 18.5 98.5 3.0 4.3 8.1 24,96 df 1.5 1.6 2.0 998.5 9998.5 14.8 23.4 2.5 2.9 is chosen as the criteria to reject the null hypothesis and the pvalue > 0.05, the null hypothesis is accepted (not rejected). However, a pvalue > 0.05 is too often misinterpreted as the probability that the null hypothesis is true or that the pvalue is the probability that the alternative hypothesis is true. 6. Pvalues are quantitative statistics often transformed into a binary measure of significance. Although this has been defended as a safeguard against bias and subjectivity, it can create a cascade of bad decisions because statistical significance often determines if a study is publishable and sets a path for future research (Mervis, 2014). In fact, authors often do not write up nonsignificant findings creating a publication bias that increases the probability of experimentwise Type 1 errors and inflates effect sizes (Franco et al., 2014). According to the American Statistical Association, “The widespread use of “statistical significance” (generally interpreted as “p £ 0.05”) as a license for making a claim of a scientific finding (or implied truth) leads to considerable distortion of the scientific process” (Wasserstein and Lazar, 2016). 7. Pvalues are essentially irreproducible (Amrhein et al., 2017). There are countless possible experiments that could be designed to test a hypothesis, understand a phenomenon, or determine and predict effects of treatments. Therefore, there are numerous ways to statistically analyze experimental data. The best experimental design and analysis is always an unknown and can only be proposed based on the existing evidence, experimental objectives, and practical constraints. With so many possible permutations, there is no universal solution. Instead, we must still rely on researchers and statisticians Fig. 4. Pvalue vs. F for varying numerator and denominator degrees of freedom. A n a ly s i s of Va r ian ce an d H yp o t hesis Test in g to use informed judgement when designing and conducting an experiment and recognize that scientific advancement is usually the product of the accumulation of experimental evidence rather than the result of a single experiment. 8. Pvalues are sometimes used inappropriately to conduct metaanalyses across experiments. Pvalues should not be used to compare treatment effects across experiments because using significance levels rather than direct comparisons leads to flawed conclusions caused by ignoring the effect size and the power of the test. (Gelman and Stern, 2006). Contrasts and Multiple Comparison Procedures In addition to using ANOVA to test the significance of effects in the linear model, ANOVA can also test more indepth and narrowlyfocused hypotheses of interest. Contrasts and multiple comparison procedures can perform additional tests of significance using an error term from the ANOVA. Contrasts are constructed to test specific differences between and among means, groups of means, or polynomial relationships among means. Contrasts are usually planned comparisons that are conceived as part of the experimental planning process to investigate treatment effects at a fine level. As an alternative to contrasts, multiple comparison procedures are used to test for significance of differences between multiple pairs of treatments means. These procedures are most suitable for qualitative and unstructured treatments (cultivars, chemical formulations, soil types, etc.) and used to determine the best treatment(s) and rank treatments into clusters. Saville (2018) in Chapter 5 provides a critique of multiple comparison procedures. He recommends that if a multiple comparison procedure is justified, the best choice is an unrestricted LSD, which is equivalent to conducting ttests for all possible treatment pairs. Case Study: The Story of Statbean: From Discovery to Field Testing Introduction This case study provides the context for an example for readers to practice conducting and interpreting an ANOVA (the data, the SAS, the R code for the analyses of all dependent variables, and SAS outputs for the example are in the supplemental documentation). Hopefully, readers will share some of Dr. Y’s enthusiasm for research. And by working with her data, you will appreciate how ANOVA can help organize and summarize the observations into useful information. The purpose of the example is not to provide a recipe for ANOVA but to illustrate a thought process and rationale that drives the ANOVA based on both contemporary and classic ANOVA theory. The example is intentionally framed to be of general interest and void of statistical complications. The analysis and results presented are also meant to serve as a platform for discussion. Readers who perform different analyses with the sample data can compare their own results to those from the analysis of the example. The purpose of the example is not how, nor what, but why. Dr. Y was beginning her career as an agronomist in search of developing alternative crops to improve sustainability through genetic, economic, and commodity diversification. When she read that the indigenous people of Lala Land used a native herb to enhance 35 36 McIntosh performance for their traditional equationsolving competition, she was intrigued. Did this plant really have bioactive properties that improved brain function? Could this species be cultivated as a medicinal plant? If so, can this species become a new crop and open new opportunities for farmers and gardeners to grow and market as a herbal supplement? Her curiosity about this plant was intense as she envisioned herself leading multidisciplinary research needed to develop a new and valuable alternative crop. Y was awarded a “seed” grant from WOMEN (World Organization of Mathematicians Exploring Nutraceuticals) to go to Lala Land to investigate and obtain samples of this promising medicinal herb. She traveled to Lala Land where leaders from the local population taught her how to grow plants to be formulated into a tonic. They gave her seed and tonic from this plant species, she named statbean (Plantus statisticus). In exchange, she taught them how to design and statistically analyze field trials using ANOVA and promised them a share of the future profits from the production or germplasm development of statbean. Upon returning home to Reality, Y conducted efficacy trials with volunteers from college statistics classes and found that the tonic significantly increased student ability to solve equations for up to three hours. Eureka! Using data demonstrating the benefit of statbean tonic, Y received funds from the State of Reality Experiment Station to investigate the potential to cultivate statbean in Reality. Research Objectives A field study was conducted to determine the effects of soil calcium and mulch on the establishment of statbean in the State of Reality. This project consisted of an experiment that was replicated at three locations. The objectives and challenges of these three statbean field experiments were similar to those encountered by the early agronomists conducting yield trials on crops to advise farmers on the effects of fertilizers and manures on crop yields. When Y conducted her experiments, she was able to benefit from a century of advancements in statistics to design and analyze her research to be confident that her results were repeatable, reliable, and valid. Experimental Description and Design– Randomization and Replication of Treatments Y established statbean research plots at three locations, the Western, Central, and Eastern Reality Research and Education Centers. These three research centers were chosen to represent the growing conditions of eastern, central, and western regions of Reality. At each location, 10 treatments were replicated in a RCBD with three blocks. The treatment design was a 5 × 2 factorial consisting of five Ca treatments (Ca_Trt: control (0), lime 1X (L1X), lime 2X (L2X), gypsum 1X (G1X), gypsum 2X (G2X) and two mulch treatments (with and without mulch). The lime (CaCO3) and gypsum (CaSO4) treatments both added Ca to the soil at two equivalent rates (1X and 2X). Lime also increases soil pH. Thus, the effect of lime on plant establishment confounds the effects of Ca with pH. Because gypsum does not increase soil pH, the gypsum treatments were used to separate soil Ca and soil pH effects on plant establishment. The location and treatment factors and interactions were considered fixed. A summary of the experiment is shown in Table 5. A n a ly s i s of Va r ian ce an d H yp o t hesis Test in g The field plot plan for the 2 × 5 factorial randomized in three blocks was generated for the three locations using PROC PLAN of SAS (Table 6). The Ca and mulch treatments (labeled 1–10) were applied to the field plots (experimental units), which were then seeded with 100 statbeans per plot. Sampling Description and Design–Measuring Dependent Variables The primary objective was to investigate the effects of selected soil treatments on statbean production in three locations. Plant establishment (Ptotal) was the dependent variable used as an indirect measure of production. The Ptotal was calculated as the number of plants per 100 seeds sown per plot. Soil pH and Ca concentration were also measured as independent variables. The lime Ca treatments were used to increase both soil pH and soil Ca, while the gypsum Ca treatments were used to increase soil Ca without changing soil pH. Thus, the ANOVAs of soil pH and soil Ca were conducted to verify and quantify the direct Ca treatment effects and interactions on the soil. Composite soil samples of 6 cores per plot were analyzed for pH and Ca concentration. Preparing, Correcting, and Knowing the Data Before using the ANOVA results, the data were scrutinized to verify that the correct data and model were being analyzed. “Garbage in, garbage out” is a familiar warning that you can trust the computer program to perform the math but the result will be garbage if the data or programming is incorrect. Regardless of what tools or whose assistance is used to perform the ANOVA, it is the researcher who is responsible for the integrity of the results. It is also a good idea to spend time to learn about and from your data before conducting an ANOVA. To do this, simple descriptive statistics and diagnostic plots can be useful. By assessing the data, the researcher can identify and resolve data issues and preview the means to be compared. Descriptive Summary of the Data Summary tables, plots, and histograms of the soil pH measurements were used to learn the pH range and distribution, discover patterns in the data, and even guestimate whether means were significantly different. The summary table (Table 7) shows that there were no missing (N) or outofrange values (Min, Max). The pH means were Table 5. Summary of the experiment at each location. Linear Model  Y = Blk + Ca_Trt + Mulch + Ca_Trt × Mulch + Error Experimental design – RCBD, 3 blocks Treatment Design – 2 × 5 factorial, 10 treatments Factors –Calcium Treatments (Ca_Trt) and Mulch Treatments (Mulch) Ca_Trt levels – Control, 1X Lime, 2X Lime, 1X Gypsum, 2X Gypsum Mulch levels – no mulch, mulch Experimental Unit – 3 m × 3 m field plot planted with 100 seed Dependent variables – Soil pH, Soil Ca, plant establishment (PTotal) PTotal Sampling Unit  plant count/100 Soil pH and soil Ca Sampling Unit – composite of 6 soil samples/plot 37 38 McIntosh Table 6. SAS code and output of plot plan for a 2×5 factorial RCBD randomized at three locations.† Proc plan seed=101420171; Factors Loc=3 ordered Blk=3 ordered trt_no=10; Run; † Seed number used to allow plot plan to be duplicated. Fig. 5. Box plots of pH at Central, East, and West locations using JMP 11.1 Pro. 39 A n a ly s i s of Va r ian ce an d H yp o t hesis Test in g slightly higher for the lime than the control or gypsum treatments, while the mulch did not appear to affect soil pH. However, the standard deviations in the summary table were based on too few observations (N = 3) to be used to determine if the differences between means were due to the treatments or mere random variation. As previously noted, an estimate of error variance based on fewer than five replications is not reliable. Box plots provide a visual summary of the pH treatment means and distribution at each location (Fig. 5). The differences among pH treatment means within locations appear small and their distributions overlap, whereas the pH means show large differences between locations and the pH distributions do not overlap. Just as a picture is worth a thousand words, box plots offer an intuitive understanding of the pH data. In contrast, summary tables provide values and require more time to assess but are more precise and can be used for rigorous statistical analysis. (Note to readers: In this chapter, JMP 11 was used to explore the data for errors and outliers. To see an example of data exploration using R software packages, refer to Chapter 14 on Multivariate Methods by Yeater and Villamil, 2018). ANOVA by Location The ANOVA was conducted first for each location as a separate experiment and then as one experiment combined over locations. The ANOVA by location is used to make inferences about the mulch and Ca treatment effects and interactions. A separate ANOVA at each location avoids issues related to heterogeneity of error variances across locations that occur with using a pooled residual error averaged over locations. However, the ANOVA combined over locations has an expanded model that includes location effects and interactions. For our example, the analyses of soil pH at the Central location and combined over locations are shown using the linear models Table 7. Summary table of soil pH data at three locations. Location Central Mulch Ca_Trt no yes East West N Mean Std Dev Min Max N Mean Std Dev Min Max N Mean Std Dev Min Max control 3 5.4 0.1 5.3 5.6 3 4.0 0.2 3.9 4.2 3 6.6 0.6 5.9 7.1 G1X 3 5.5 0.4 5.2 5.9 3 4.0 0.1 3.9 4.1 3 6.6 0.4 6.2 6.9 G2X 3 5.7 0.2 5.6 5.9 3 4.0 0.2 3.9 4.2 3 6.5 0.8 5.7 7.1 L1X 3 6.0 0.4 5.6 6.3 3 4.2 0.3 3.9 4.4 3 6.9 0.3 6.6 7.2 L2X 3 6.1 0.1 6.0 6.1 3 4.4 0.3 4.1 4.8 3 6.9 0.4 6.5 7.3 control 3 5.7 0.3 5.5 6.1 3 4.1 0.2 4.0 4.3 3 6.8 0.3 6.5 7.0 G1X 3 5.9 0.2 5.7 6.1 3 4.0 0.2 3.9 4.2 3 6.7 0.3 6.4 7.1 G2X 3 5.5 0.4 5.1 5.9 3 3.9 0.2 3.8 4.1 3 6.5 0.7 5.8 7.2 L1X 3 6.0 0.2 5.7 6.1 3 4.3 0.3 4.1 4.7 3 6.9 0.3 6.6 7.3 L2X 3 6.1 0.3 5.8 6.3 3 4.3 0.4 4.1 4.8 3 7.1 0.2 6.8 7.3 40 McIntosh Table 8. Linear model effects for one location and combined over locations. One Location Combined Locations Linear Model pH = Blk+ Ca_Trt + Mulch + Ca_Trt×Mulch + Error pH = Location + Blk(Location) + Ca_Trt + Mulch + Ca_Trt ×Mulch + Location×Ca_Trt + Location×Mulch + Location×Ca_Trt ×Mulch + Error Fixed Effects Ca_Trt, Mulch, Ca_Trt ×Mulch Location, Ca_Trt, Mulch, Ca_Trt ×Mulch, Location× Ca_Trt, Location×Mulch, Location× Ca_Trt ×Mulch Random Effects Blk, Error Blk(Location), Error Table 9. SAS Code – PROC MIXED for pH at each location. data anova.statbean; set anova.statbean; proc sort; by loc; run; Title ‘Statbean Data’; proc print; run; Title ‘Mixed pH ANOVA by location’; proc mixed data=anova.statbean plots=residualpanel method=type3; by loc; class Blk Mulch Ca_Trt; model pH=Mulch Ca_Trt Mulch*Ca_Trt; random Blk; lsmeans Mulch Ca_Trt Mulch*Ca_Trt; run; in Table 8. All effects in both models except for Blk(Loc) were fixed and inferences were limited to the locations and treatments included in the experiments. ANOVA by Location SAS code for PROC MIXED The PROC MIXED statements to perform an ANOVA to test the significance of the effects of mulch, Ca treatments (Ca_Trt), and Ca_Trt × Mulch for soil pH at each location (Table 9) are: i) PROC statement to invoke the Mixed procedure with options a) plots = residualpanel to request plots of residuals panels, and b) method = type3 to print a comprehensive ANOVA table with EMS; ii) BY statement to request a separate analysis for each location; iii) CLASS statement to identify the classification (qualitative) model effects; iv) MODEL statement to define the fixed effects to be tested for significance for one dependent variable; v) RANDOM statement to identify the random model effects other than the residual (random error) effect; and vi) LSMEANS statement to request least squares means and their standard erorrs. Residual Plots The pH data have been checked and found to be free of typing errors and outliers. We also need to determine whether the classical ANOVA assumptions are justified to choose the most appropriate ANOVA analyses. These assumptions can be assessed using statistical tests and/or visual interpretation of graphs. Although statistical tests may seem to be more objective and definitive, plots of residual values are more powerful and useful for identifying the cause of an assumption violation (Kozak and Piepho, 2017, Loy et al., 2016). As part of the PROC MIXED analysis, diagnostic plots of the residuals (observedpredicted) were requested to check for normality, homogeneity, A n a ly s i s of Va r ian ce an d H yp o t hesis Test in g and independence of error. Diagnostic plots of the residual values for pH at the Central location requested using the “plot = residualpanel” option are shown in Fig. 6. The upper left residual panel is a scatterplot of residual vs. predicted values with a reference line for residual values at zero. If the residual values are not randomly scattered across the range of predicted values, this can indicate outliers or a violation of the classical assumption that the residuals are independent and the variation is homogeneous. If one or a few residuals are unusually distant from the reference line and most points are tightly clustered near the reference line, the distant points should be investigated to determine if they are outliers and need to be removed from the data. If the residuals form a pattern, often a cone shape, the means and variances are correlated, indicating the variances may not be independent or homogeneous. Another reason to further investigate the data prior to analysis. The upper right panel is a histogram of the residuals with a line referencing a normal distribution. The residuals show a nearly normal distribution. The minor deviations from the curve are expected based on the small sample size (n = 30) for the analysis of a single location. The lower left panel is a normal probability or QQ plot, a widely used and powerful tool for checking for normality. The residuals are plotted against the quantiles of the normal distribution with a reference line for the normal distribution. Deviations from the line indicate deviations from the normal distribution. The QQ plot also shows that the distribution is approximately normal. The very minor deviations from normality at the tails of the distribution would have at most a nominal effect on the ANOVA Ftests. The lower right panel reports the simple statistics that help spot problems in the data as follows: i) number of observations missing values, ii) minimum and maximum residuals outliers, iii) standard deviation–compare error variances between analyses. The fit statistics are used to compare models and identify covariance structures, which we are assuming to be unstructured. In this example, the statistics do not reveal any outliers and the classical ANOVA assumptions appear justified. Fig. 6. Residual panels for pH at Central location. 41 42 McIntosh Table 10. PROC MIXED output for pH at the Central location using Type III estimation method. Results The PROC MIXED output using the Method=Type3 option for soil pH at the Central location is shown in Table 10. The Model Information provides important details about the statistical methods used for the mixed model analysis. The Class Levels are useful for checking the class level values and their order sorted. For example, the standard Type III least squares method (Method = Type3 option), rather than the default REML method, was used to estimate the variances of random effects in order to obtain a comprehensive ANOVA table. The example data are balanced, so the Type III and REML estimates of variance components are the same. Also, note in the Model Information that the Covariance Structure used is Variance Components. Thus, the Covariance Parameter Estimates in the PROC MIXED output are variance component estimates of VAR(Blk) and VAR(Residual). These variance components comprise the EMS for the Blk and Residual SOVs. The Blk MS (0.138463) estimates Var(Residual) +10 Var(Blk) = (0.06877+ 10(0.0006969) and the Residual MS estimates VAR(Residual) = 0.06877. It is common with mixed model ANOVAs that tests of significance are important for fixed but not random effects, especially random design effects such as blocking factors. Thus, the PROC MIXED default output does not include a standard ANOVA table. Instead, a table of Ftests of fixed effects is given showing the ndf and ddf for the Ftest, the calculated Fvalue, and probability of a greater Fvalue (P > F). The comprehensive ANOVA table presents the same results for tests of fixed effects along with additional ANOVA components that provide an overview of the experimental design and enough information to assess the appropriateness and power of the analysis. Whether testing hypotheses by choosing a Type 1 error rate (a = 0.05) as a significance level or using the P > F (pvalue ≤ 0.05) to reject the null hypothesis, the Ca_Trt main effect was significant for soil pH but the Mulch main effect and the Mulch × Ca_Trt interaction were not significant. Thus, we infer that the Ca treatments but not A n a ly s i s of Va r ian ce an d H yp o t hesis Test in g the mulch treatments significantly affected the soil pH and that the Ca_Trt effect was the same for the mulched and nonmulched plots. Planned and Multiple Pairwise Comparisons The ANOVA tests of the significance of the linear model effects did not include tests of all hypotheses of interest for the case study or test for significant differences between individual means. Regardless of whether the Ca_Trt effect was signficant, planned comparisons (contrasts) can be used to partition the Ca_Trt main effect and the Ca_Trt × Mulch interaction to delve deeper into the effects and interactions of lime and gypsum treatments. Remember that the Ca treatments added Ca to the soil in different forms (lime or gypsum) and that both lime and gypsum were expected to increase soil Ca concentration. However, the lime but not the gypsum was also expected to increase soil pH. The test of the Ca_Trt effect confirmed that the Ca treatments significantly affected soil Ca concentration (ANOVA in the online supplement) and soil pH (Table 10) at the Central location. While contrasts were used to test whether the effects of lime and gypsum were significantly different; determine if the response to rates of lime or gypsum is linear or quadratic; and test for their significant interactions with mulch. The SAS code and output for these contrasts are given in Table 11 for soil Ca and soil pH at the Central location. These contrasts help us interpret the variation in soil Ca and soil pH means in the context of the error variation. The contrasts confirmed that: lime significantly linearly increased both soil Ca and soil pH; gypsum did not significantly affect soil pH or soil Ca; and the effects of lime and gypsum on soil pH, but not soil Ca, were significantly different. Even though gypsum adds Ca to the soil, the increase in soil Ca concentration was not significant. This is probably due to a Type 2 error because of a lack of power of the test. In addition to understanding the basic soil responses to the lime and gypsum treatments, the treatment means of the soil pH and Ca concentration are themselves important. PROC MIXED automatically gives the least squares means and standard errors for fixed effects. However, since only the Ca_Trt main effect was significant, the Ca_Trt means and LSD(0.05) are shown in a bar graph (Fig. 7). The LSD bar, centered around each mean serves as a confidence interval. Thus, if the LSD bars of two means do not overlap, the means are significantly different at the 0.05 level. For the untreated plots at the Central location, the mean soil pH was 5.6 and mean soil Ca concentration was 622 mg kg1. It can be seen from the graphs that the pH of the lime treatments were significantly higher than the control but the difference between the 1X and 2X rates was not significant. As expected, the mean pH's of the 1X and the 2X gypsum treatments did not differ significantly from the control or from each other. Similar to the soil pH, the mean soil Ca concentrations of the lime treatments were significantly higher than the control but not from each other, while the differences due to gypsum treatments were not significant. Although the LSD procedure is often used to assign letters to means to indicate significantly differences between means, the LSD bars can be more informative because they offer a quantitative measure of precision associated with the differences between means. For example, the soil Ca graph shows an LSD value of 400 mg kg1 Ca, 43 44 MCINTOSH Fig. 7. Bar graph of Ca_Trt means and LSD(0.05) for soil pH and soil Ca concentrations at the Central Location. The LSD bar is centered around each mean and the means are signiﬁcantly different if their LSD bars do not overlap. The LSD(0.05) for pH = 0.32; LSD(0.05) for Ca concentration = 400 mg kg1. TAbLE 11. SAS code and output for contrasts testing the effects of lime and gypsum Ca treatments (Ca_Trt) on soil pH and soil Ca at the Central location. which is a large value in comparison with the mean. This indicates that the precision may be considered inadequate for the research objective. The results of the planned and multiple comparisons answer some questions but also raise others, such as: What difference in soil Ca is of practical importance? Were the number of replications or cores sampled too few or error control inadequate for meaningful differences in soil Ca also be deemed statistically different? Would higher rates of gypsum result in significant increases in soil Ca? Is there some unknown reason why the gypsum treatments did not increase the soil Ca concentration? So far, these are interesting results to be interpreted as a piece of an unfolding puzzle. To learn more about the analysis and interpretation of factorial experiments, readers are encouraged to read Chapter 7 by Vargas et al. (2018). A nalysis o f Variance and H ypothesis T esting ANOVA Combined Over Fixed Locations The linear model for ANOVA combined over locations (Table 8) adds terms for the location main effect and interactions, allowing tests of significant differences between and across locations. Location was considered a fixed effect because too few locations were studied to be considered as a representative random sample of all locations in the State of Reality (Piepho et al., 2003). Thus, inferences about location were narrow and limited in scope to the three locations studied. SAS Code – PROC MIXED pH Combined Over Locations The revised PROC MIXED statements for the combined ANOVA are given in Table 12. Because the model included numerous factors and interactions, the SAS bar operator () was used in the Model statement. The shortcut notation LocMulchCa_Trt substituted for Loc, Mulch, and Ca_Trt main effects and all their possible interactions, which is not only convenient but also safeguards against inadvertently overlooking interactions. Checking for Heterogeneit y of Variance We have already assessed the residual plots for each location and deemed that the classical ANOVA assumptions were appropriate. Before using the combined ANOVA, we also need to assess the residual plots generated from fitting the combined linear model. One common concern is that the error variances may not be the same at all locations, violating the assumption for homogeneity of variance and indicating that the pooled error term (Residual MS) may not be appropriate for tests of significance. The residual error MS’s obtained from the ANOVA at each location (not shown) were 0.068, 0.040, and 0.016. Using the Hartley’s Fmax, the ratio of the largest and smallest error MS, as a rough measure of heterogeneity of variance, we note a fourfold difference between the highest and lowest error variances (Hartley’s Fmax = 4.38), which indicates a need to further investigate the nature and extent of possible error heterogeneity. Good news, the residual panels from the combined analysis did not reveal any serious data problems (Fig. 8). Although the residuals near the predicted pH of 6.0 are more dispersed than the low and high predicted pH values, this pattern is slight. Evidence of either heterogeneity or nonnormality of the residuals which can affect the pvalues of Ftests was not seen in the residual panels. ANOVA Results– Combined over Locations The PROC MIXED results of the ANOVA combined over locations for soil pH include estimates of the random variance components, tests of significance for fixed effects, and an ANOVA table with the EMS used to construct the Type III tests of significance (Table 13). The combined ANOVA has two random error variance components. The Table 12. SAS Code – PROC MIXED for pH combined over locations. Title 'Mixed pH ANOVA combined locations'; proc mixed data=anova.statbean plots=residualpanel method=type3; class Loc Blk Mulch Ca_Trt; model pH=LocMulchCa_Trt; random Blk(Loc); lsmeans LocMulchCa_Trt; run; 45 46 McIntosh Fig. 8. Residual panels for pH combined over locations. Table 13. PROC MIXED output for pH combined over locations using Type 3 estimation method. Blk(Loc) is the average of the Blk effects and the Residual is the errors pooled over locations. The Blk(Loc) variance component estimate is 0.076, almost twice the Residual variance estimate of 0.042. It can be seen in the ANOVA table that the Blk(Loc) MS is the error term for the Loc effect and the Residual MS is the error term for the other fixed effects. Because the Blk(Loc) MS is based on fewer df (6 df vs. 54 df) and was significantly larger (0.81 vs 0.04) than the Residual MS, the test of significance for Loc has less power and a higher probability of a Type 2 error than the tests of the other fixed effects. Regardless, the main effects of Loc and Ca_Trt were highly significant (p < 0.001), while the main effect of Mulch and all interactions were not significant (a = 0.05). A nalysis o f Variance and H ypothesis T esting Planned and Multiple Pairwise Comparison Planned comparisons similar to those used for the ANOVA at the Central location can also be used for the combined ANOVA. For the combined ANOVA, the contrasts in Table 11 can be recoded to test the lime and gypsum effects both averaged over locations and/or within locations using the pooled residual error term. Contrasts testing specific Loc x Ca_Trt interactions can also be conducted. Details regarding contrasts for the combined ANOVA are beyond the scope of this chapter but can be found in Chapter 7 (Vargas, et al., 2018). As previously mentioned, the soil pH and soil Ca concentration means provide information important for statbean production. If statbean is to become a new crop, growers will want to know if the pH and Ca concentration of their soil are suitable for statbean and if it will “pay” to add lime. And they probably want recommendations that are based on scientifically and statistically sound research. Means should usually be given in tables or graphs to highlight the significant effects, especially significant interactions. For both soil pH and soil Ca concentration, the Loc and Ca_Trt effects were significant but not the Loc × Ca_Trt interaction. Therefore, bar graphs are shown for the Ca_Trt means at each location (Fig. 9). These graphs illustrate the large and significant differences between locations, the smaller yet significant differences between Ca_Trt means within each location, and that the differences among Ca_Trt means are similar at each location. As in Fig. 7, LSD(0.05) bars centered around the means give a confidence interval for the mean and show which means are significantly different from other means. The graphs can also be used to interpret the results of the planned comparisons. Presentation of ANOVA Results For research to be published as a graduate thesis or in a quality scientific journal, it must meet rigorous scientific standards, which include demonstrating that the statistical design and the analysis are appropriate and support the research conclusions (Table 1). This entails Table 14. Statistical information that can be determined from ANOVA components in a SIMPLE ANOVA. ANOVA Component What the reader can directly or indirectly determine Source of Variation effects in the linear model explanatory and design effects construction of the effects (additive, nested, or crossed) treatment factors and interactions Effect Type random and fixed model effects scope of inference for an effect (narrow or broad) model type (random, mixed, fixed) experimental design and error terms Degrees of Freedom number of treatments or levels of each factor number of replications, samples/replication adequacy of df for estimates of variances of random effects adequacy of sample size for tests of significance Fvalue  fixed effects relative magnitude of the effect size Mean Square  random effects test significance for additional mean comparisons test for significant differences between error terms standard deviations and standard errors P > F or *, **, **** probability used for the test of significance 47 48 MCINTOSH Fig. 9. Bar graph of Ca_Trt means and LSD(0.05) for soil pH and soil Ca concentrations at each the eastern, western, and central locations. The LSD(0.05) is centered around each mean so means are signiﬁcantly different if the LSD bars do not overlap. The LSD(0.05) for pH = 0.24; LSD(0.05) for Ca concentration= 341 mg kg 1. TAbLE 15. SIMPLE ANOVA for soil pH at three locations. loc=Central loc=Western loc=eastern source effect df F Value p>F F Value p>F F Value p>F Blk R 2       Ca_Trt F 4 4.3 ** 5.9 ** 11.2 ** Mulch F 1 <1 ns 1.4 ns <1 ns Mulch*Trt F 4 <1 ns <1 ns Error RE 18 (0.07)† 1.2 (0.04) ns (0.02) † MS of random error in parentheses TAbLE 16. SIMPLE ANOVA for pH, Ca, and Ptotal. ph Ca F (MS)‡ P>F 2 65.1 0.01 6 (0.076) DF F (MS) P>F 50.2 0.01 ptotal F (MS) P>F 3.9 0.08 Source Effect† Loc F Blk(Loc) RE Mulch F 1 1.9 0.18 0.4 0.50 3.8 0.06 LocxMulch F 2 0.3 0.74 0.6 0.52 12.1 0.001 Ca_Trt F 4 16.2 0.01 6.5 0.01 0.9 0.49 LocxCa_Trt F 8 0.5 0.87 1.1 0.34 1.3 0.24 MulchxCa_Trt F 4 1.4 0.24 0.0 1.0 0.3 0.88 LocxMulch×Ca_Trt F 8 0.5 0.86 0.5 0.85 0.4 0.93 Residual RE 54 (603646) (0.042) † Eﬀect Type: ﬁxed (F) or random error (RE). ‡ Fvalues for ﬁxed eﬀect, MS values in parentheses for random eﬀects (86858) (1093) (65) A n a ly s i s of Va r ian ce an d H yp o t hesis Test in g describing the research objectives, experimental design, dependent and explanatory variables, and statistical tests in sufficient detail for reviewers and readers to independently judge their computational, statistical, and especially biological soundness. The description should also enable other researchers to repeat the experiment. An ANOVA table can be constructed to encapsulate most of this statistical information. Analysis of variance tables are especially effective for research with complex designs, many dependent variables, multiple factors, multiple error terms, and/or mixed models. In addition, ANOVA tables provide readers a convenient place to find, verify, and interpret ANOVA results as they read through a paper. An ANOVA table has been proposed that contains ANOVA components chosen to concisely quantify, summarize, and organize the key details of the design, analysis, and results of an experiment (McIntosh, 2015). This ANOVA table format, termed SIMPLE (Simple, Informative, Meaningful, Powerful, Logical, Effective), includes selected ANOVA components for all SOV in the linear model to provide the statistical information listed in Table 14. Just as ANOVA is not “onesizefits all”, neither is a SIMPLE ANOVA table. Instead, it is a framework for authors to consider for presentation of ANOVA results, which can be compared to other options. Examples of SIMPLE ANOVA tables for soil pH analyzed separately at each location and for the three dependent variables (soil pH, soil Ca, and Ptotal) combined over locations are shown in Tables 15 and 16. Statistics were rounded to show only the informative digits and not exaggerate their precision. Probability values only report two decimal places or significance levels no less than 0.01 (**). The SIMPLE ANOVA tables differ from more common tables of significance tests of fixed effects because they also give the SOV and df of random effects, including error terms. Thus, an ANOVA table containing SOV for each term in the linear model can facilitate a holistic understanding of the ANOVA that includes attributes about the experimental design and error variation. The ANOVA process, which included planned comparisons and LSD values to rank and compare means provided the statistical basis for interpreting the effects of the factors and their interactions on the means of the dependent variables. It has also provided estimates of random error variance components that can be useful in designing future experiments with desired level of precision. Conclusions One goal of this chapter was to reinforce and enhance your practical knowledge of ANOVA. Maybe you also have increased your appreciation for ANOVA as a powerful research tool for enhancing scientific inquiry and defeating the spread of junk (irreproducible) science. An underlying goal of this chapter is to raise awareness of practices for using ANOVA that will encourage rigorous statistical standards for publishing research and require rationale for and description of statistical methods with sufficient detail to both understand and justify their use. The concepts and suggestions in this chapter are meant to support your efforts to ensure that your research meets these standards but do not involve the analysis of complex models and nonnormal data. These require a deeper understanding of statistical concepts beyond those covered here. Hopefully you will find helpful, practical discussions of more complex topics in other chapters in this book. Also, always keep in mind that it is wise to involve a skilled statistician from the start to finish of an experiment. Finally, if you want to try statbean as a herbal supplement to enhance your statistical skill, it will be available as soon we can convince the producers that our 49 50 McIntosh new statbean cultivar is significantly better than the old cultivar. In our first trial we chose a significance level of 0.05 and the new cultivar was not significantly better than the old cultivar (P > F = 0.0541). Thus, thus the new cultivar is no better than the old. Do you agree? Key Learning Points ·· Basic concepts and terms of ANOVA. ·· ANOVA process to test hypotheses. ·· Use and relevance of the separate components of ANOVA. ·· How to conduct a mixed model ANOVA using SAS or R. ·· How to construct an effective ANOVA table for scientific publication. ·· The role of ANOVA to maintain standards and advance science. Review Questions (T/F) 1. ANOVA is a statistical approach used to discover the factors that contribute to the variation of the dependent variable. 2. A linear model was the mathematical basis of the traditional ANOVA but is not used for the contemporary mixed model. 3. The ratio of Var(Treatment)/Var(Residual) is used to test if a treatment effect is significant. If Var(Treatment) is larger than Var(Residual), then the Fvalue will be greater than 1 and the treatment effect will be considered significant. 4. The probability of a Type 1 error (incorrectly concluding that treatment means are significantly different) can be decreased by increasing the number of replications. 5. To ensure scientific standards, research papers are subjected to peer review prior to publication. These papers need to include sufficient detail for peer reviewers and readers to judge the validity and merit of the experimental design and statistical analyses. Exercises 1. Find an interesting article in a journal in your field of study to "peer review" and answer the following questions. a. Do the descriptions of the experimental design and statistical analyses provide adequate detail and clarity? Explain your answer using Table 1 as a guide. b. Are the design, analysis, and variables measured appropriate for the experimental objectives? Justify your answer. c. Are the results and conclusions substantiated by the statistics provided? Justify your answer. 2. Researcher A published a paper that found that the treatment effect was significant at the 0.05 level. Researcher B had conducted an experiment with the same A n a ly s i s of Va r ian ce an d H yp o t hesis Test in g treatments but concluded that the treatment effect was not significant at the 0.05 level. Not surprisingly, Researcher B did not publish these nonsignificant results. a. Give at least two likely reasons for the different results. b. Based on one or more of your reasons in 2a, describe a scenario that justifies publishing Researcher A's findings but not Researcher B's findings. Explain your justification, including the expected longterm consequences. c. Based on one or more of your reasons in 2a, describe a scenario that justifies publishing both Researcher A's and Researcher B's findings. Explain your justification, including the expected longterm consequences. 3. Pvalues are quantitative statistics that are often reduced to two categories (significant and nonsignificant) to make inferences about the data. What are the advantages and disadvantages of this common practice? References Acutis, M., B. Scaglia, and R. Confalonieri. 2012. Perfunctory analysis of variance in agronomy, and its consequences in experimental results interpretation. Eur. J. Agron. 43:129–135. doi:10.1016/j.eja.2012.06.006 Amrhein, V., F. KornerNievergelt, and T. Roth. 2017. The earth is flat (p > 0.05): Significance thresholds and the crisis of unreplicable research. PeerJ 5:e3544. Boos, D.D., and L.A. Stefanski. 2011. Pvalue precision and reproducibility. Am. Stat. 65:213– 221. doi:10.1198/tas.2011.10129 Box, G.E. 1976. Science and statistics. J. Am. Stat. Assoc. 71:791–799. doi:10.1080/01621459.1976.10480949 Casler, M.D. 2018a. Blocking principles for biological experiments. In: B. Glaz and K.M. Yeater, Applied statistics in the agricultural,biological, and environmental sciences. ASA, CSSA, SSSA, Madison, WI. Casler, M.D. 2018b. Power and replicationDesigning powerful experiments. In: B. Glaz and K.M. Yeater, Applied statistics in the agricultural,biological, and environmental sciences. ASA, CSSA, SSSA, Madison, WI. Dixon, P.M., K.J. Moore, and E. van Santen. 2018. The analysis of combined experiments. In: B. Glaz and K.M. Yeater, Applied statistics in the agricultural,biological, and environmental sciences. ASA, CSSA, SSSA, Madison, WI. Fisher, R.A. 1921. Studies in crop variation. I. An examination of the yield of dressed grain from Broadbalk. J. Agric. Sci. 11:107–135. doi:10.1017/S0021859600003750 Fisher, R.A. 1925. Statistical methods for research workers. Oliver and Boyd, Edinburgh, UK. Fisher, R.A. 1926. Arrangement of field experiments. J. Minist. Agric. (G. B.) 33:503–513. Fisher, R.A., and W.A. Mackenzie. 1923. Studies in crop variation. II. The manurial response of different potato varieties. J. Agric. Sci. 13:311–320. doi:10.1017/S0021859600003592 Franco, A., N. Malhotra, and G. Simonovits. 2014. Publication bias in the social sciences: Unlocking the file drawer. Science 345:1502–1505. doi:10.1126/science.1255484 GarlandCampbell, K. 2018. Errors in statistical decisionmaking. In: B. Glaz and K.M. Yeater, Applied statistics in the agricultural, biological, and environmental sciences. ASA, CSSA, SSSA, Madison, WI. Gbur, E.E., W.W. Stroup, K.S. McCarter, S. Durham, L.J. Young, M. Christman, M. West, and M. Kramer. 2012. Analysis of generalized linear mixed models in the agricultural and natural resources sciences. ASA, CSSA, SSSA, Madison, WI. Gelman, A., and H. Stern. 2006. The difference between “significant” and “not significant” is not itself statistically significant. Am. Stat. 60:328–331. doi:10.1198/000313006X152649 Kozak, M. and H.P. Piepho. 2017. What’s normal anyway? Residual plots are more telling than significance tests when checking ANOVA assumptions. Journal of Agronomy and Crop Science 2017: 113. doi:10.1111/jac.12220. 51 52 McIntosh Lehmann, E.L. 1993. The Fisher, NeymanPearson theories of testing hypotheses: One theory or two? J. Am. Stat. Assoc. 88:1242–1249. Littell, R.C., G.A. Milliken, W.W. Stroup, R.D. Wolfinger, and O. Schabenberger. 2006. SAS for Mixed Models. Second edition ed. SAS Institute, Cary, NC. Loy, A., L. Follett, and H. Hofmann. 2016. Variations of Q–Q Plots: The power of our eyes! Am. Stat. 70:202–214. doi:10.1080/00031305.2015.1077728 Majumder, M., H. Hofmann, and D. Cook. 2013. Validation of visual statistical inference, applied to linear models. J. Am. Stat. Assoc. 108:942–956. doi:10.1080/01621459.2013.808157 McIntosh, M.S. 2015. Can analysis of variance be more significant? Agron. J. 107:706717. doi:10.2134/agronj14.0177 Mervis, J. 2014. Why null results rarely see the light of day. Science 345:992. doi:10.1126/ science.345.6200.992 Moore, K.J., and P.M. Dixon. 2015. Analysis of combined experiments revisited. Agron. J. 107:763–771. doi:10.2134/agronj13.0485 Murdoch, D.J., Y.L. Tsai, and J. Adcock. 2008. Pvalues are random variables. Am. Stat. 62:242– 245. doi:10.1198/000313008X332421 Neyman, J., and B. Tokarska. 1936. Errors of the second kind in testing “Student’s” hypothesis. J. Am. Stat. Assoc. 31:318–326. doi:10.2307/2278560 Nuzzo, R. 2014. Scientific method: Statistical errors. Nature 506:150–152. doi:10.1038/506150a Piepho, H.P., A. Buchse, and K. Emrich. 2003. A hitchhiker’s guide to mixed models for randomized experiments. J. Agron. Crop Sci. 189:310–322. Saville, D.J. 2015. Multiple comparison procedures—Cutting the Gordian knot. Agron. J. 107:730–735. doi:10.2134/agronj2012.0394 Saville, D.J. 2018. Multiple comparison procedures: The ins and outs. In: B. Glaz and K.M. Yeater, Applied statistics in the agricultural,biological, and environmental sciences. ASA, CSSA, SSSA, Madison, WI. Spilke, J., H.P. Piepho, and X. Hu. 2005. Analysis of unbalanced data by mixed linear models using the MIXED procedure of the SAS system. J. Agron. Crop Sci. 191:47–54. doi:10.1111/j.1439037X.2004.00120.x Stroup, W. 2018. Analysis of nonGaussian data. In: B. Glaz and K.M. Yeater, Applied statistics in the agricultural, biological, and environmental sciences. ASA, CSSA, SSSA, Madison, WI. Vargas, M., B. Glaz, J. Crossa, and A. Morgounov. 2018. Analysis and interpretation of interactions of fixed and random effects. In: B. Glaz and K.M. Yeater, Applied statistics in the agricultural,biological, and environmental sciences. ASA, CSSA, SSSA, Madison, WI. Wasserstein, R.L. and N.A. Lazar. 2016. The ASA’s statement on pvalues: Context, process, and purpose. The American Statistician 70(2): 129–133. doi:10.1080/00031305.2016.1154108. Yang, R.C. 2010. Towards understanding and use of mixedmodel analysis of agricultural experiments. Can. J. Plant Sci. 90:605–627. doi:10.4141/CJPS10049 Yeater, K.M. and M.B. Villamil. 2018. Multivariate methods for agricultural research. In: B. Glaz and K.M. Yeater, Applied statistics in the agricultural, biological, and environmental sciences. ASA, CSSA, SSSA, Madison, WI.