fbpx
Wikipedia

Biostatistics

Biostatistics (also known as biometry) are the development and application of statistical methods to a wide range of topics in biology. It encompasses the design of biological experiments, the collection and analysis of data from those experiments and the interpretation of the results.

History

Biostatistics and genetics

Biostatistical modeling forms an important part of numerous modern biological theories. Genetics studies, since its beginning, used statistical concepts to understand observed experimental results. Some genetics scientists even contributed with statistical advances with the development of methods and tools. Gregor Mendel started the genetics studies investigating genetics segregation patterns in families of peas and used statistics to explain the collected data. In the early 1900s, after the rediscovery of Mendel's Mendelian inheritance work, there were gaps in understanding between genetics and evolutionary Darwinism. Francis Galton tried to expand Mendel's discoveries with human data and proposed a different model with fractions of the heredity coming from each ancestral composing an infinite series. He called this the theory of "Law of Ancestral Heredity". His ideas were strongly disagreed by William Bateson, who followed Mendel's conclusions, that genetic inheritance were exclusively from the parents, half from each of them. This led to a vigorous debate between the biometricians, who supported Galton's ideas, as Raphael Weldon, Arthur Dukinfield Darbishire and Karl Pearson, and Mendelians, who supported Bateson's (and Mendel's) ideas, such as Charles Davenport and Wilhelm Johannsen. Later, biometricians could not reproduce Galton conclusions in different experiments, and Mendel's ideas prevailed. By the 1930s, models built on statistical reasoning had helped to resolve these differences and to produce the neo-Darwinian modern evolutionary synthesis.

Solving these differences also allowed to define the concept of population genetics and brought together genetics and evolution. The three leading figures in the establishment of population genetics and this synthesis all relied on statistics and developed its use in biology.

These and other biostatisticians, mathematical biologists, and statistically inclined geneticists helped bring together evolutionary biology and genetics into a consistent, coherent whole that could begin to be quantitatively modeled.

In parallel to this overall development, the pioneering work of D'Arcy Thompson in On Growth and Form also helped to add quantitative discipline to biological study.

Despite the fundamental importance and frequent necessity of statistical reasoning, there may nonetheless have been a tendency among biologists to distrust or deprecate results which are not qualitatively apparent. One anecdote describes Thomas Hunt Morgan banning the Friden calculator from his department at Caltech, saying "Well, I am like a guy who is prospecting for gold along the banks of the Sacramento River in 1849. With a little intelligence, I can reach down and pick up big nuggets of gold. And as long as I can do that, I'm not going to let any people in my department waste scarce resources in placer mining."[3]

Research planning

Any research in life sciences is proposed to answer a scientific question we might have. To answer this question with a high certainty, we need accurate results. The correct definition of the main hypothesis and the research plan will reduce errors while taking a decision in understanding a phenomenon. The research plan might include the research question, the hypothesis to be tested, the experimental design, data collection methods, data analysis perspectives and costs involved. It is essential to carry the study based on the three basic principles of experimental statistics: randomization, replication, and local control.

Research question

The research question will define the objective of a study. The research will be headed by the question, so it needs to be concise, at the same time it is focused on interesting and novel topics that may improve science and knowledge and that field. To define the way to ask the scientific question, an exhaustive literature review might be necessary. So the research can be useful to add value to the scientific community.[4]

Hypothesis definition

Once the aim of the study is defined, the possible answers to the research question can be proposed, transforming this question into a hypothesis. The main propose is called null hypothesis (H0) and is usually based on a permanent knowledge about the topic or an obvious occurrence of the phenomena, sustained by a deep literature review. We can say it is the standard expected answer for the data under the situation in test. In general, HO assumes no association between treatments. On the other hand, the alternative hypothesis is the denial of HO. It assumes some degree of association between the treatment and the outcome. Although, the hypothesis is sustained by question research and its expected and unexpected answers.[4]

As an example, consider groups of similar animals (mice, for example) under two different diet systems. The research question would be: what is the best diet? In this case, H0 would be that there is no difference between the two diets in mice metabolism (H0: μ1 = μ2) and the alternative hypothesis would be that the diets have different effects over animals metabolism (H1: μ1 ≠ μ2).

The hypothesis is defined by the researcher, according to his/her interests in answering the main question. Besides that, the alternative hypothesis can be more than one hypothesis. It can assume not only differences across observed parameters, but their degree of differences (i.e. higher or shorter).

Sampling

Usually, a study aims to understand an effect of a phenomenon over a population. In biology, a population is defined as all the individuals of a given species, in a specific area at a given time. In biostatistics, this concept is extended to a variety of collections possible of study. Although, in biostatistics, a population is not only the individuals, but the total of one specific component of their organisms, as the whole genome, or all the sperm cells, for animals, or the total leaf area, for a plant, for example.

It is not possible to take the measures from all the elements of a population. Because of that, the sampling process is very important for statistical inference. Sampling is defined as to randomly get a representative part of the entire population, to make posterior inferences about the population. So, the sample might catch the most variability across a population.[5] The sample size is determined by several things, since the scope of the research to the resources available. In clinical research, the trial type, as inferiority, equivalence, and superiority is a key in determining sample size.[4]

Experimental design

Experimental designs sustain those basic principles of experimental statistics. There are three basic experimental designs to randomly allocate treatments in all plots of the experiment. They are completely randomized design, randomized block design, and factorial designs. Treatments can be arranged in many ways inside the experiment. In agriculture, the correct experimental design is the root of a good study and the arrangement of treatments within the study is essential because environment largely affects the plots (plants, livestock, microorganisms). These main arrangements can be found in the literature under the names of "lattices", "incomplete blocks", "split plot", "augmented blocks", and many others. All of the designs might include control plots, determined by the researcher, to provide an error estimation during inference.

In clinical studies, the samples are usually smaller than in other biological studies, and in most cases, the environment effect can be controlled or measured. It is common to use randomized controlled clinical trials, where results are usually compared with observational study designs such as case–control or cohort.[6]

Data collection

Data collection methods must be considered in research planning, because it highly influences the sample size and experimental design.

Data collection varies according to type of data. For qualitative data, collection can be done with structured questionnaires or by observation, considering presence or intensity of disease, using score criterion to categorize levels of occurrence.[7] For quantitative data, collection is done by measuring numerical information using instruments.

In agriculture and biology studies, yield data and its components can be obtained by metric measures. However, pest and disease injuries in plats are obtained by observation, considering score scales for levels of damage. Especially, in genetic studies, modern methods for data collection in field and laboratory should be considered, as high-throughput platforms for phenotyping and genotyping. These tools allow bigger experiments, while turn possible evaluate many plots in lower time than a human-based only method for data collection. Finally, all data collected of interest must be stored in an organized data frame for further analysis.

Analysis and data interpretation

Descriptive tools

Data can be represented through tables or graphical representation, such as line charts, bar charts, histograms, scatter plot. Also, measures of central tendency and variability can be very useful to describe an overview of the data. Follow some examples:

Frequency tables

One type of tables are the frequency table, which consists of data arranged in rows and columns, where the frequency is the number of occurrences or repetitions of data. Frequency can be:[8]

Absolute: represents the number of times that a determined value appear;

 

Relative: obtained by the division of the absolute frequency by the total number;

 

In the next example, we have the number of genes in ten operons of the same organism.

Genes = {2,3,3,4,5,3,3,3,3,4}
Genes number Absolute frequency Relative frequency
1 0 0
2 1 0.1
3 6 0.6
4 2 0.2
5 1 0.1

Line graph

 
Figure A: Line graph example. The birth rate in Brazil (2010–2016);[9] Figure B: Bar chart example. The birth rate in Brazil for the December months from 2010 to 2016; Figure C: Example of Box Plot: number of glycines in the proteome of eight different organisms (A-H); Figure D: Example of a scatter plot.

Line graphs represent the variation of a value over another metric, such as time. In general, values are represented in the vertical axis, while the time variation is represented in the horizontal axis.[10]

Bar chart

A bar chart is a graph that shows categorical data as bars presenting heights (vertical bar) or widths (horizontal bar) proportional to represent values. Bar charts provide an image that could also be represented in a tabular format.[10]

In the bar chart example, we have the birth rate in Brazil for the December months from 2010 to 2016.[9] The sharp fall in December 2016 reflects the outbreak of Zika virus in the birth rate in Brazil.

Histograms

 
Example of a histogram.

The histogram (or frequency distribution) is a graphical representation of a dataset tabulated and divided into uniform or non-uniform classes. It was first introduced by Karl Pearson.[11]

Scatter plot

A scatter plot is a mathematical diagram that uses Cartesian coordinates to display values of a dataset. A scatter plot shows the data as a set of points, each one presenting the value of one variable determining the position on the horizontal axis and another variable on the vertical axis.[12] They are also called scatter graph, scatter chart, scattergram, or scatter diagram.[13]

Mean

The arithmetic mean is the sum of a collection of values ( ) divided by the number of items of this collection ( ).

 

Median

The median is the value in the middle of a dataset.

Mode

The mode is the value of a set of data that appears most often.[14]

Comparison among mean, median and mode
Values = { 2,3,3,3,3,3,4,4,11 }
Type Example Result
Mean( 2 + 3 + 3 + 3 + 3 + 3 + 4 + 4 + 11 ) / 9 4
Median 2, 3, 3, 3, 3, 3, 4, 4, 11 3
Mode 2, 3, 3, 3, 3, 3, 4, 4, 11 3

Box plot

Box plot is a method for graphically depicting groups of numerical data. The maximum and minimum values are represented by the lines, and the interquartile range (IQR) represent 25–75% of the data. Outliers may be plotted as circles.

Correlation coefficients

Although correlations between two different kinds of data could be inferred by graphs, such as scatter plot, it is necessary validate this though numerical information. For this reason, correlation coefficients are required. They provide a numerical value that reflects the strength of an association.[10]

Pearson correlation coefficient

 
Scatter diagram that demonstrates the Pearson correlation for different values of ρ.

Pearson correlation coefficient is a measure of association between two variables, X and Y. This coefficient, usually represented by ρ (rho) for the population and r for the sample, assumes values between −1 and 1, where ρ = 1 represents a perfect positive correlation, ρ = −1 represents a perfect negative correlation, and ρ = 0 is no linear correlation.[10]

Inferential statistics

It is used to make inferences[15] about an unknown population, by estimation and/or hypothesis testing. In other words, it is desirable to obtain parameters to describe the population of interest, but since the data is limited, it is necessary to make use of a representative sample in order to estimate them. With that, it is possible to test previously defined hypotheses and apply the conclusions to the entire population. The standard error of the mean is a measure of variability that is crucial to do inferences.[5]

Hypothesis testing is essential to make inferences about populations aiming to answer research questions, as settled in "Research planning" section. Authors defined four steps to be set:[5]

  1. The hypothesis to be tested: as stated earlier, we have to work with the definition of a null hypothesis (H0), that is going to be tested, and an alternative hypothesis. But they must be defined before the experiment implementation.
  2. Significance level and decision rule: A decision rule depends on the level of significance, or in other words, the acceptable error rate (α). It is easier to think that we define a critical value that determines the statistical significance when a test statistic is compared with it. So, α also has to be predefined before the experiment.
  3. Experiment and statistical analysis: This is when the experiment is really implemented following the appropriate experimental design, data is collected and the more suitable statistical tests are evaluated.
  4. Inference: Is made when the null hypothesis is rejected or not rejected, based on the evidence that the comparison of p-values and α brings. It is pointed that the failure to reject H0 just means that there is not enough evidence to support its rejection, but not that this hypothesis is true.

A confidence interval is a range of values that can contain the true real parameter value in given a certain level of confidence. The first step is to estimate the best-unbiased estimate of the population parameter. The upper value of the interval is obtained by the sum of this estimate with the multiplication between the standard error of the mean and the confidence level. The calculation of lower value is similar, but instead of a sum, a subtraction must be applied.[5]

Statistical considerations

Power and statistical error

When testing a hypothesis, there are two types of statistic errors possible: Type I error and Type II error. The type I error or false positive is the incorrect rejection of a true null hypothesis and the type II error or false negative is the failure to reject a false null hypothesis. The significance level denoted by α is the type I error rate and should be chosen before performing the test. The type II error rate is denoted by β and statistical power of the test is 1 − β.

p-value

The p-value is the probability of obtaining results as extreme as or more extreme than those observed, assuming the null hypothesis (H0) is true. It is also called the calculated probability. It is common to confuse the p-value with the significance level (α), but, the α is a predefined threshold for calling significant results. If p is less than α, the null hypothesis (H0) is rejected.[16]

Multiple testing

In multiple tests of the same hypothesis, the probability of the occurrence of falses positives (familywise error rate) increase and some strategy are used to control this occurrence. This is commonly achieved by using a more stringent threshold to reject null hypotheses. The Bonferroni correction defines an acceptable global significance level, denoted by α* and each test is individually compared with a value of α = α*/m. This ensures that the familywise error rate in all m tests, is less than or equal to α*. When m is large, the Bonferroni correction may be overly conservative. An alternative to the Bonferroni correction is to control the false discovery rate (FDR). The FDR controls the expected proportion of the rejected null hypotheses (the so-called discoveries) that are false (incorrect rejections). This procedure ensures that, for independent tests, the false discovery rate is at most q*. Thus, the FDR is less conservative than the Bonferroni correction and have more power, at the cost of more false positives.[17]

Mis-specification and robustness checks

The main hypothesis being tested (e.g., no association between treatments and outcomes) is often accompanied by other technical assumptions (e.g., about the form of the probability distribution of the outcomes) that are also part of the null hypothesis. When the technical assumptions are violated in practice, then the null may be frequently rejected even if the main hypothesis is true. Such rejections are said to be due to model mis-specification.[18] Verifying whether the outcome of a statistical test does not change when the technical assumptions are slightly altered (so-called robustness checks) is the main way of combating mis-specification.

Model selection criteria

Model criteria selection will select or model that more approximate true model. The Akaike's Information Criterion (AIC) and The Bayesian Information Criterion (BIC) are examples of asymptotically efficient criteria.

Developments and big data

Recent developments have made a large impact on biostatistics. Two important changes have been the ability to collect data on a high-throughput scale, and the ability to perform much more complex analysis using computational techniques. This comes from the development in areas as sequencing technologies, Bioinformatics and Machine learning (Machine learning in bioinformatics).

Use in high-throughput data

New biomedical technologies like microarrays, next-generation sequencers (for genomics) and mass spectrometry (for proteomics) generate enormous amounts of data, allowing many tests to be performed simultaneously.[19] Careful analysis with biostatistical methods is required to separate the signal from the noise. For example, a microarray could be used to measure many thousands of genes simultaneously, determining which of them have different expression in diseased cells compared to normal cells. However, only a fraction of genes will be differentially expressed.[20]

Multicollinearity often occurs in high-throughput biostatistical settings. Due to high intercorrelation between the predictors (such as gene expression levels), the information of one predictor might be contained in another one. It could be that only 5% of the predictors are responsible for 90% of the variability of the response. In such a case, one could apply the biostatistical technique of dimension reduction (for example via principal component analysis). Classical statistical techniques like linear or logistic regression and linear discriminant analysis do not work well for high dimensional data (i.e. when the number of observations n is smaller than the number of features or predictors p: n < p). As a matter of fact, one can get quite high R2-values despite very low predictive power of the statistical model. These classical statistical techniques (esp. least squares linear regression) were developed for low dimensional data (i.e. where the number of observations n is much larger than the number of predictors p: n >> p). In cases of high dimensionality, one should always consider an independent validation test set and the corresponding residual sum of squares (RSS) and R2 of the validation test set, not those of the training set.

Often, it is useful to pool information from multiple predictors together. For example, Gene Set Enrichment Analysis (GSEA) considers the perturbation of whole (functionally related) gene sets rather than of single genes.[21] These gene sets might be known biochemical pathways or otherwise functionally related genes. The advantage of this approach is that it is more robust: It is more likely that a single gene is found to be falsely perturbed than it is that a whole pathway is falsely perturbed. Furthermore, one can integrate the accumulated knowledge about biochemical pathways (like the JAK-STAT signaling pathway) using this approach.

Bioinformatics advances in databases, data mining, and biological interpretation

The development of biological databases enables storage and management of biological data with the possibility of ensuring access for users around the world. They are useful for researchers depositing data, retrieve information and files (raw or processed) originated from other experiments or indexing scientific articles, as PubMed. Another possibility is search for the desired term (a gene, a protein, a disease, an organism, and so on) and check all results related to this search. There are databases dedicated to SNPs (dbSNP), the knowledge on genes characterization and their pathways (KEGG) and the description of gene function classifying it by cellular component, molecular function and biological process (Gene Ontology).[22] In addition to databases that contain specific molecular information, there are others that are ample in the sense that they store information about an organism or group of organisms. As an example of a database directed towards just one organism, but that contains much data about it, is the Arabidopsis thaliana genetic and molecular database – TAIR.[23] Phytozome,[24] in turn, stores the assemblies and annotation files of dozen of plant genomes, also containing visualization and analysis tools. Moreover, there is an interconnection between some databases in the information exchange/sharing and a major initiative was the International Nucleotide Sequence Database Collaboration (INSDC)[25] which relates data from DDBJ,[26] EMBL-EBI,[27] and NCBI.[28]

Nowadays, increase in size and complexity of molecular datasets leads to use of powerful statistical methods provided by computer science algorithms which are developed by machine learning area. Therefore, data mining and machine learning allow detection of patterns in data with a complex structure, as biological ones, by using methods of supervised and unsupervised learning, regression, detection of clusters and association rule mining, among others.[22] To indicate some of them, self-organizing maps and k-means are examples of cluster algorithms; neural networks implementation and support vector machines models are examples of common machine learning algorithms.

Collaborative work among molecular biologists, bioinformaticians, statisticians and computer scientists is important to perform an experiment correctly, going from planning, passing through data generation and analysis, and ending with biological interpretation of the results.[22]

Use of computationally intensive methods

On the other hand, the advent of modern computer technology and relatively cheap computing resources have enabled computer-intensive biostatistical methods like bootstrapping and re-sampling methods.

In recent times, random forests have gained popularity as a method for performing statistical classification. Random forest techniques generate a panel of decision trees. Decision trees have the advantage that you can draw them and interpret them (even with a basic understanding of mathematics and statistics). Random Forests have thus been used for clinical decision support systems.[citation needed]

Applications

Public health

Public health, including epidemiology, health services research, nutrition, environmental health and health care policy & management. In these medicine contents, it's important to consider the design and analysis of the clinical trials. As one example, there is the assessment of severity state of a patient with a prognosis of an outcome of a disease.

With new technologies and genetics knowledge, biostatistics are now also used for Systems medicine, which consists in a more personalized medicine. For this, is made an integration of data from different sources, including conventional patient data, clinico-pathological parameters, molecular and genetic data as well as data generated by additional new-omics technologies.[29]

Quantitative genetics

The study of Population genetics and Statistical genetics in order to link variation in genotype with a variation in phenotype. In other words, it is desirable to discover the genetic basis of a measurable trait, a quantitative trait, that is under polygenic control. A genome region that is responsible for a continuous trait is called Quantitative trait locus (QTL). The study of QTLs become feasible by using molecular markers and measuring traits in populations, but their mapping needs the obtaining of a population from an experimental crossing, like an F2 or Recombinant inbred strains/lines (RILs). To scan for QTLs regions in a genome, a gene map based on linkage have to be built. Some of the best-known QTL mapping algorithms are Interval Mapping, Composite Interval Mapping, and Multiple Interval Mapping.[30]

However, QTL mapping resolution is impaired by the amount of recombination assayed, a problem for species in which it is difficult to obtain large offspring. Furthermore, allele diversity is restricted to individuals originated from contrasting parents, which limit studies of allele diversity when we have a panel of individuals representing a natural population.[31] For this reason, the Genome-wide association study was proposed in order to identify QTLs based on linkage disequilibrium, that is the non-random association between traits and molecular markers. It was leveraged by the development of high-throughput SNP genotyping.[32]

In animal and plant breeding, the use of markers in selection aiming for breeding, mainly the molecular ones, collaborated to the development of marker-assisted selection. While QTL mapping is limited due resolution, GWAS does not have enough power when rare variants of small effect that are also influenced by environment. So, the concept of Genomic Selection (GS) arises in order to use all molecular markers in the selection and allow the prediction of the performance of candidates in this selection. The proposal is to genotype and phenotype a training population, develop a model that can obtain the genomic estimated breeding values (GEBVs) of individuals belonging to a genotype and but not phenotype population, called testing population.[33] This kind of study could also include a validation population, thinking in the concept of cross-validation, in which the real phenotype results measured in this population are compared with the phenotype results based on the prediction, what used to check the accuracy of the model.

As a summary, some points about the application of quantitative genetics are:

Expression data

Studies for differential expression of genes from RNA-Seq data, as for RT-qPCR and microarrays, demands comparison of conditions. The goal is to identify genes which have a significant change in abundance between different conditions. Then, experiments are designed appropriately, with replicates for each condition/treatment, randomization and blocking, when necessary. In RNA-Seq, the quantification of expression uses the information of mapped reads that are summarized in some genetic unit, as exons that are part of a gene sequence. As microarray results can be approximated by a normal distribution, RNA-Seq counts data are better explained by other distributions. The first used distribution was the Poisson one, but it underestimate the sample error, leading to false positives. Currently, biological variation is considered by methods that estimate a dispersion parameter of a negative binomial distribution. Generalized linear models are used to perform the tests for statistical significance and as the number of genes is high, multiple tests correction have to be considered.[34] Some examples of other analysis on genomics data comes from microarray or proteomics experiments.[35][36] Often concerning diseases or disease stages.[37]

Other studies

Tools

There are a lot of tools that can be used to do statistical analysis in biological data. Most of them are useful in other areas of knowledge, covering a large number of applications (alphabetical). Here are brief descriptions of some of them:

  • ASReml: Another software developed by VSNi[40] that can be used also in R environment as a package. It is developed to estimate variance components under a general linear mixed model using restricted maximum likelihood (REML). Models with fixed effects and random effects and nested or crossed ones are allowed. Gives the possibility to investigate different variance-covariance matrix structures.
  • CycDesigN:[41] A computer package developed by VSNi[40] that helps the researchers create experimental designs and analyze data coming from a design present in one of three classes handled by CycDesigN. These classes are resolvable, non-resolvable, partially replicated and crossover designs. It includes less used designs the Latinized ones, as t-Latinized design.[42]
  • Orange: A programming interface for high-level data processing, data mining and data visualization. Include tools for gene expression and genomics.[22]
  • R: An open source environment and programming language dedicated to statistical computing and graphics. It is an implementation of S language maintained by CRAN.[43] In addition to its functions to read data tables, take descriptive statistics, develop and evaluate models, its repository contains packages developed by researchers around the world. This allows the development of functions written to deal with the statistical analysis of data that comes from specific applications.[44] In the case of Bioinformatics, for example, there are packages located in the main repository (CRAN) and in others, as Bioconductor. It is also possible to use packages under development that are shared in hosting-services as GitHub.
  • SAS: A data analysis software widely used, going through universities, services and industry. Developed by a company with the same name (SAS Institute), it uses SAS language for programming.
  • PLA 3.0:[45] Is a biostatistical analysis software for regulated environments (e.g. drug testing) which supports Quantitative Response Assays (Parallel-Line, Parallel-Logistics, Slope-Ratio) and Dichotomous Assays (Quantal Response, Binary Assays). It also supports weighting methods for combination calculations and the automatic data aggregation of independent assay data.
  • Weka: A Java software for machine learning and data mining, including tools and methods for visualization, clustering, regression, association rule, and classification. There are tools for cross-validation, bootstrapping and a module of algorithm comparison. Weka also can be run in other programming languages as Perl or R.[22]
  • Python (programming language) image analysis, deep-learning, machine-learning
  • SQL databases
  • NoSQL
  • NumPy numerical python
  • SciPy
  • SageMath
  • LAPACK linear algebra
  • MATLAB
  • Apache Hadoop
  • Apache Spark
  • Amazon Web Services

Scope and training programs

Almost all educational programmes in biostatistics are at postgraduate level. They are most often found in schools of public health, affiliated with schools of medicine, forestry, or agriculture, or as a focus of application in departments of statistics.

In the United States, where several universities have dedicated biostatistics departments, many other top-tier universities integrate biostatistics faculty into statistics or other departments, such as epidemiology. Thus, departments carrying the name "biostatistics" may exist under quite different structures. For instance, relatively new biostatistics departments have been founded with a focus on bioinformatics and computational biology, whereas older departments, typically affiliated with schools of public health, will have more traditional lines of research involving epidemiological studies and clinical trials as well as bioinformatics. In larger universities around the world, where both a statistics and a biostatistics department exist, the degree of integration between the two departments may range from the bare minimum to very close collaboration. In general, the difference between a statistics program and a biostatistics program is twofold: (i) statistics departments will often host theoretical/methodological research which are less common in biostatistics programs and (ii) statistics departments have lines of research that may include biomedical applications but also other areas such as industry (quality control), business and economics and biological areas other than medicine.

Specialized journals

  • Biostatistics[46]
  • International Journal of Biostatistics[47]
  • Journal of Epidemiology and Biostatistics[48]
  • Biostatistics and Public Health[49]
  • Biometrics[50]
  • Biometrika[51]
  • Biometrical Journal[52]
  • Communications in Biometry and Crop Science[53]
  • Statistical Applications in Genetics and Molecular Biology[54]
  • Statistical Methods in Medical Research[55]
  • Pharmaceutical Statistics[56]
  • Statistics in Medicine[57]

See also

References

  1. ^ Centre for Transformative Innovation, Swinburne University of Technology. "Allan, Frances Elizabeth (Betty) - Person - Encyclopedia of Australian Science and Innovation". www.eoas.info. Retrieved 2022-10-26.
  2. ^ Gunter, Chris (10 December 2008). "Quantitative Genetics". Nature. 456 (7223): 719. Bibcode:2008Natur.456..719G. doi:10.1038/456719a. PMID 19079046.
  3. ^ Charles T. Munger (2003-10-03). "Academic Economics: Strengths and Faults After Considering Interdisciplinary Needs" (PDF). Archived (PDF) from the original on 2022-10-09.
  4. ^ a b c Nizamuddin, Sarah L.; Nizamuddin, Junaid; Mueller, Ariel; Ramakrishna, Harish; Shahul, Sajid S. (October 2017). "Developing a Hypothesis and Statistical Planning". Journal of Cardiothoracic and Vascular Anesthesia. 31 (5): 1878–1882. doi:10.1053/j.jvca.2017.04.020. PMID 28778775.
  5. ^ a b c d Overholser, Brian R; Sowinski, Kevin M (2017). "Biostatistics Primer: Part I". Nutrition in Clinical Practice. 22 (6): 629–35. doi:10.1177/0115426507022006629. PMID 18042950.
  6. ^ Szczech, Lynda Anne; Coladonato, Joseph A.; Owen, William F. (4 October 2002). "Key Concepts in Biostatistics: Using Statistics to Answer the Question "Is There a Difference?"". Seminars in Dialysis. 15 (5): 347–351. doi:10.1046/j.1525-139X.2002.00085.x. PMID 12358639. S2CID 30875225.
  7. ^ Sandelowski, Margarete (2000). "Combining Qualitative and Quantitative Sampling, Data Collection, and Analysis Techniques in Mixed-Method Studies". Research in Nursing & Health. 23 (3): 246–255. CiteSeerX 10.1.1.472.7825. doi:10.1002/1098-240X(200006)23:3<246::AID-NUR9>3.0.CO;2-H. PMID 10871540.
  8. ^ Maths, Sangaku. "Absolute, relative, cumulative frequency and statistical tables – Probability and Statistics". www.sangakoo.com. Retrieved 2018-04-10.
  9. ^ a b "DATASUS: TabNet Win32 3.0: Nascidos vivos – Brasil". DATASUS: Tecnologia da Informação a Serviço do SUS.
  10. ^ a b c d Forthofer, Ronald N.; Lee, Eun Sul (1995). Introduction to Biostatistics. A Guide to Design, Analysis, and Discovery. Academic Press. ISBN 978-0-12-262270-0.
  11. ^ Pearson, Karl (1895-01-01). "X. Contributions to the mathematical theory of evolution.—II. Skew variation in homogeneous material". Phil. Trans. R. Soc. Lond. A. 186: 343–414. Bibcode:1895RSPTA.186..343P. doi:10.1098/rsta.1895.0010. ISSN 0264-3820.
  12. ^ Utts, Jessica M. (2005). Seeing through statistics (3rd ed.). Belmont, CA: Thomson, Brooks/Cole. ISBN 978-0534394028. OCLC 56568530.
  13. ^ Jarrell, Stephen B. (1994). Basic statistics. Dubuque, Iowa: Wm. C. Brown Pub. ISBN 978-0697215956. OCLC 30301196.
  14. ^ Gujarati, Damodar N. (2006). Econometrics. McGraw-Hill Irwin.
  15. ^ "Essentials of Biostatistics in Public Health & Essentials of Biostatistics Workbook: Statistical Computing Using Excel". Australian and New Zealand Journal of Public Health. 33 (2): 196–197. 2009. doi:10.1111/j.1753-6405.2009.00372.x. ISSN 1326-0200.
  16. ^ Baker, Monya (2016). "Statisticians issue warning over misuse of P values". Nature. 531 (7593): 151. Bibcode:2016Natur.531..151B. doi:10.1038/nature.2016.19503. PMID 26961635.
  17. ^ Benjamini, Y. & Hochberg, Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological) 57, 289–300 (1995).
  18. ^ "Null hypothesis". www.statlect.com. Retrieved 2018-05-08.
  19. ^ Hayden, Erika Check (8 February 2012). "Biostatistics: Revealing analysis". Nature. 482 (7384): 263–265. doi:10.1038/nj7384-263a. PMID 22329008.
  20. ^ Efron, Bradley (February 2008). "Microarrays, Empirical Bayes and the Two-Groups Model". Statistical Science. 23 (1): 1–22. arXiv:0808.0572. doi:10.1214/07-STS236. S2CID 8417479.
  21. ^ Subramanian, A.; Tamayo, P.; Mootha, V. K.; Mukherjee, S.; Ebert, B. L.; Gillette, M. A.; Paulovich, A.; Pomeroy, S. L.; Golub, T. R.; Lander, E. S.; Mesirov, J. P. (30 September 2005). "Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles". Proceedings of the National Academy of Sciences. 102 (43): 15545–15550. Bibcode:2005PNAS..10215545S. doi:10.1073/pnas.0506580102. PMC 1239896. PMID 16199517.
  22. ^ a b c d e Moore, Jason H (2007). "Bioinformatics". Journal of Cellular Physiology. 213 (2): 365–9. doi:10.1002/jcp.21218. PMID 17654500. S2CID 221831488.
  23. ^ "TAIR - Home Page". www.arabidopsis.org.
  24. ^ "Phytozome". phytozome.jgi.doe.gov.
  25. ^ "International Nucleotide Sequence Database Collaboration - INSDC". www.insdc.org.
  26. ^ "Top". www.ddbj.nig.ac.jp.
  27. ^ "The European Bioinformatics Institute < EMBL-EBI". www.ebi.ac.uk.
  28. ^ "National Center for Biotechnology Information". www.ncbi.nlm.nih.gov. U. S. National Library of Medicine –.
  29. ^ Apweiler, Rolf; et al. (2018). "Whither systems medicine?". Experimental & Molecular Medicine. 50 (3): e453. doi:10.1038/emm.2017.290. PMC 5898894. PMID 29497170.
  30. ^ Zeng, Zhao-Bang (2005). "QTL mapping and the genetic basis of adaptation: Recent developments". Genetica. 123 (1–2): 25–37. doi:10.1007/s10709-004-2705-0. PMID 15881678. S2CID 1094152.
  31. ^ Korte, Arthur; Farlow, Ashley (2013). "The advantages and limitations of trait analysis with GWAS: A review". Plant Methods. 9: 29. doi:10.1186/1746-4811-9-29. PMC 3750305. PMID 23876160.
  32. ^ Zhu, Chengsong; Gore, Michael; Buckler, Edward S; Yu, Jianming (2008). "Status and Prospects of Association Mapping in Plants". The Plant Genome. 1: 5–20. doi:10.3835/plantgenome2008.02.0089.
  33. ^ Crossa, José; Pérez-Rodríguez, Paulino; Cuevas, Jaime; Montesinos-López, Osval; Jarquín, Diego; De Los Campos, Gustavo; Burgueño, Juan; González-Camacho, Juan M; Pérez-Elizalde, Sergio; Beyene, Yoseph; Dreisigacker, Susanne; Singh, Ravi; Zhang, Xuecai; Gowda, Manje; Roorkiwal, Manish; Rutkoski, Jessica; Varshney, Rajeev K (2017). "Genomic Selection in Plant Breeding: Methods, Models, and Perspectives" (PDF). Trends in Plant Science. 22 (11): 961–975. doi:10.1016/j.tplants.2017.08.011. PMID 28965742. Archived (PDF) from the original on 2022-10-09.
  34. ^ Oshlack, Alicia; Robinson, Mark D; Young, Matthew D (2010). "From RNA-seq reads to differential expression results". Genome Biology. 11 (12): 220. doi:10.1186/gb-2010-11-12-220. PMC 3046478. PMID 21176179.
  35. ^ Helen Causton; John Quackenbush; Alvis Brazma (2003). Statistical Analysis of Gene Expression Microarray Data. Wiley-Blackwell.
  36. ^ Terry Speed (2003). Microarray Gene Expression Data Analysis: A Beginner's Guide. Chapman & Hall/CRC.
  37. ^ Frank Emmert-Streib; Matthias Dehmer (2010). Medical Biostatistics for Complex Diseases. Wiley-Blackwell. ISBN 978-3-527-32585-6.
  38. ^ Warren J. Ewens; Gregory R. Grant (2004). Statistical Methods in Bioinformatics: An Introduction. Springer.
  39. ^ Matthias Dehmer; Frank Emmert-Streib; Armin Graber; Armindo Salvador (2011). Applied Statistics for Network Biology: Methods in Systems Biology. Wiley-Blackwell. ISBN 978-3-527-32750-8.
  40. ^ a b "Home - VSN International". www.vsni.co.uk.
  41. ^ "CycDesigN - VSN International". www.vsni.co.uk.
  42. ^ Piepho, Hans-Peter; Williams, Emlyn R; Michel, Volker (2015). "Beyond Latin Squares: A Brief Tour of Row-Column Designs". Agronomy Journal. 107 (6): 2263. doi:10.2134/agronj15.0144.
  43. ^ "The Comprehensive R Archive Network". cran.r-project.org.
  44. ^ Renganathan V (2021). Biostatistics explored through R software: An overview. ISBN 9789354936586.
  45. ^ Stegmann, Dr Ralf (2019-07-01). "PLA 3.0". PLA 3.0 – Software for Biostatistical Analysis. Retrieved 2019-07-02.
  46. ^ "Biostatistics - Oxford Academic". OUP Academic.
  47. ^ "The International Journal of Biostatistics".
  48. ^ "PubMed Journals will be shut down". 15 June 2018.
  49. ^ https://ebph.it/ Epidemiology
  50. ^ "Biometrics". onlinelibrary.wiley.com. doi:10.1111/(ISSN)1541-0420.
  51. ^ "Biometrika - Oxford Academic". OUP Academic.
  52. ^ "Biometrical Journal". onlinelibrary.wiley.com. doi:10.1002/(ISSN)1521-4036.
  53. ^ "Communications in Biometry and Crop Science". agrobiol.sggw.waw.pl.
  54. ^ "Statistical Applications in Genetics and Molecular Biology". www.degruyter.com. 1 May 2002.
  55. ^ "Statistical Methods in Medical Research". SAGE Journals.
  56. ^ "Pharmaceutical Statistics". onlinelibrary.wiley.com.
  57. ^ "Statistics in Medicine". onlinelibrary.wiley.com. doi:10.1002/(ISSN)1097-0258.

External links

  Media related to Biostatistics at Wikimedia Commons

  • The International Biometric Society
  • Guide to Biostatistics (MedPageToday.com)

biostatistics, biometry, redirects, here, automated, recognition, people, based, intrinsic, physical, behavioural, traits, biometrics, academic, journal, journal, also, known, biometry, development, application, statistical, methods, wide, range, topics, biolo. Biometry redirects here For the automated recognition of people based on intrinsic physical or behavioural traits see Biometrics For the academic journal see Biostatistics journal Biostatistics also known as biometry are the development and application of statistical methods to a wide range of topics in biology It encompasses the design of biological experiments the collection and analysis of data from those experiments and the interpretation of the results Contents 1 History 1 1 Biostatistics and genetics 2 Research planning 2 1 Research question 2 2 Hypothesis definition 2 3 Sampling 2 4 Experimental design 2 5 Data collection 3 Analysis and data interpretation 3 1 Descriptive tools 3 1 1 Frequency tables 3 1 2 Line graph 3 1 3 Bar chart 3 1 4 Histograms 3 1 5 Scatter plot 3 1 6 Mean 3 1 7 Median 3 1 8 Mode 3 1 9 Box plot 3 1 10 Correlation coefficients 3 1 11 Pearson correlation coefficient 3 2 Inferential statistics 4 Statistical considerations 4 1 Power and statistical error 4 2 p value 4 3 Multiple testing 4 4 Mis specification and robustness checks 4 5 Model selection criteria 5 Developments and big data 5 1 Use in high throughput data 5 2 Bioinformatics advances in databases data mining and biological interpretation 5 3 Use of computationally intensive methods 6 Applications 6 1 Public health 6 2 Quantitative genetics 6 3 Expression data 6 4 Other studies 7 Tools 8 Scope and training programs 9 Specialized journals 10 See also 11 References 12 External linksHistory EditBiostatistics and genetics Edit Biostatistical modeling forms an important part of numerous modern biological theories Genetics studies since its beginning used statistical concepts to understand observed experimental results Some genetics scientists even contributed with statistical advances with the development of methods and tools Gregor Mendel started the genetics studies investigating genetics segregation patterns in families of peas and used statistics to explain the collected data In the early 1900s after the rediscovery of Mendel s Mendelian inheritance work there were gaps in understanding between genetics and evolutionary Darwinism Francis Galton tried to expand Mendel s discoveries with human data and proposed a different model with fractions of the heredity coming from each ancestral composing an infinite series He called this the theory of Law of Ancestral Heredity His ideas were strongly disagreed by William Bateson who followed Mendel s conclusions that genetic inheritance were exclusively from the parents half from each of them This led to a vigorous debate between the biometricians who supported Galton s ideas as Raphael Weldon Arthur Dukinfield Darbishire and Karl Pearson and Mendelians who supported Bateson s and Mendel s ideas such as Charles Davenport and Wilhelm Johannsen Later biometricians could not reproduce Galton conclusions in different experiments and Mendel s ideas prevailed By the 1930s models built on statistical reasoning had helped to resolve these differences and to produce the neo Darwinian modern evolutionary synthesis Solving these differences also allowed to define the concept of population genetics and brought together genetics and evolution The three leading figures in the establishment of population genetics and this synthesis all relied on statistics and developed its use in biology Ronald Fisher worked alongside statistician Betty Allan developing several basic statistical methods in support of his work studying the crop experiments at Rothamsted Research published in Fisher s books Statistical Methods for Research Workers 1925 and The Genetical Theory of Natural Selection 1930 as well as Allan s scientific papers 1 Fisher went on to give many contributions to genetics and statistics Some of them include the ANOVA p value concepts Fisher s exact test and Fisher s equation for population dynamics He is credited for the sentence Natural selection is a mechanism for generating an exceedingly high degree of improbability 2 Sewall G Wright developed F statistics and methods of computing them and defined inbreeding coefficient J B S Haldane s book The Causes of Evolution reestablished natural selection as the premier mechanism of evolution by explaining it in terms of the mathematical consequences of Mendelian genetics He also developed the theory of primordial soup These and other biostatisticians mathematical biologists and statistically inclined geneticists helped bring together evolutionary biology and genetics into a consistent coherent whole that could begin to be quantitatively modeled In parallel to this overall development the pioneering work of D Arcy Thompson in On Growth and Form also helped to add quantitative discipline to biological study Despite the fundamental importance and frequent necessity of statistical reasoning there may nonetheless have been a tendency among biologists to distrust or deprecate results which are not qualitatively apparent One anecdote describes Thomas Hunt Morgan banning the Friden calculator from his department at Caltech saying Well I am like a guy who is prospecting for gold along the banks of the Sacramento River in 1849 With a little intelligence I can reach down and pick up big nuggets of gold And as long as I can do that I m not going to let any people in my department waste scarce resources in placer mining 3 Research planning EditAny research in life sciences is proposed to answer a scientific question we might have To answer this question with a high certainty we need accurate results The correct definition of the main hypothesis and the research plan will reduce errors while taking a decision in understanding a phenomenon The research plan might include the research question the hypothesis to be tested the experimental design data collection methods data analysis perspectives and costs involved It is essential to carry the study based on the three basic principles of experimental statistics randomization replication and local control Research question Edit The research question will define the objective of a study The research will be headed by the question so it needs to be concise at the same time it is focused on interesting and novel topics that may improve science and knowledge and that field To define the way to ask the scientific question an exhaustive literature review might be necessary So the research can be useful to add value to the scientific community 4 Hypothesis definition Edit Once the aim of the study is defined the possible answers to the research question can be proposed transforming this question into a hypothesis The main propose is called null hypothesis H0 and is usually based on a permanent knowledge about the topic or an obvious occurrence of the phenomena sustained by a deep literature review We can say it is the standard expected answer for the data under the situation in test In general HO assumes no association between treatments On the other hand the alternative hypothesis is the denial of HO It assumes some degree of association between the treatment and the outcome Although the hypothesis is sustained by question research and its expected and unexpected answers 4 As an example consider groups of similar animals mice for example under two different diet systems The research question would be what is the best diet In this case H0 would be that there is no difference between the two diets in mice metabolism H0 m1 m2 and the alternative hypothesis would be that the diets have different effects over animals metabolism H1 m1 m2 The hypothesis is defined by the researcher according to his her interests in answering the main question Besides that the alternative hypothesis can be more than one hypothesis It can assume not only differences across observed parameters but their degree of differences i e higher or shorter Sampling Edit Usually a study aims to understand an effect of a phenomenon over a population In biology a population is defined as all the individuals of a given species in a specific area at a given time In biostatistics this concept is extended to a variety of collections possible of study Although in biostatistics a population is not only the individuals but the total of one specific component of their organisms as the whole genome or all the sperm cells for animals or the total leaf area for a plant for example It is not possible to take the measures from all the elements of a population Because of that the sampling process is very important for statistical inference Sampling is defined as to randomly get a representative part of the entire population to make posterior inferences about the population So the sample might catch the most variability across a population 5 The sample size is determined by several things since the scope of the research to the resources available In clinical research the trial type as inferiority equivalence and superiority is a key in determining sample size 4 Experimental design Edit Experimental designs sustain those basic principles of experimental statistics There are three basic experimental designs to randomly allocate treatments in all plots of the experiment They are completely randomized design randomized block design and factorial designs Treatments can be arranged in many ways inside the experiment In agriculture the correct experimental design is the root of a good study and the arrangement of treatments within the study is essential because environment largely affects the plots plants livestock microorganisms These main arrangements can be found in the literature under the names of lattices incomplete blocks split plot augmented blocks and many others All of the designs might include control plots determined by the researcher to provide an error estimation during inference In clinical studies the samples are usually smaller than in other biological studies and in most cases the environment effect can be controlled or measured It is common to use randomized controlled clinical trials where results are usually compared with observational study designs such as case control or cohort 6 Data collection Edit Data collection methods must be considered in research planning because it highly influences the sample size and experimental design Data collection varies according to type of data For qualitative data collection can be done with structured questionnaires or by observation considering presence or intensity of disease using score criterion to categorize levels of occurrence 7 For quantitative data collection is done by measuring numerical information using instruments In agriculture and biology studies yield data and its components can be obtained by metric measures However pest and disease injuries in plats are obtained by observation considering score scales for levels of damage Especially in genetic studies modern methods for data collection in field and laboratory should be considered as high throughput platforms for phenotyping and genotyping These tools allow bigger experiments while turn possible evaluate many plots in lower time than a human based only method for data collection Finally all data collected of interest must be stored in an organized data frame for further analysis Analysis and data interpretation EditDescriptive tools Edit Main article Descriptive statistics Data can be represented through tables or graphical representation such as line charts bar charts histograms scatter plot Also measures of central tendency and variability can be very useful to describe an overview of the data Follow some examples Frequency tables Edit One type of tables are the frequency table which consists of data arranged in rows and columns where the frequency is the number of occurrences or repetitions of data Frequency can be 8 Absolute represents the number of times that a determined value appear N f 1 f 2 f 3 f n displaystyle N f 1 f 2 f 3 f n Relative obtained by the division of the absolute frequency by the total number n i f i N displaystyle n i frac f i N In the next example we have the number of genes in ten operons of the same organism Genes 2 3 3 4 5 3 3 3 3 4 Genes number Absolute frequency Relative frequency1 0 02 1 0 13 6 0 64 2 0 25 1 0 1Line graph Edit Figure A Line graph example The birth rate in Brazil 2010 2016 9 Figure B Bar chart example The birth rate in Brazil for the December months from 2010 to 2016 Figure C Example of Box Plot number of glycines in the proteome of eight different organisms A H Figure D Example of a scatter plot Line graphs represent the variation of a value over another metric such as time In general values are represented in the vertical axis while the time variation is represented in the horizontal axis 10 Bar chart Edit A bar chart is a graph that shows categorical data as bars presenting heights vertical bar or widths horizontal bar proportional to represent values Bar charts provide an image that could also be represented in a tabular format 10 In the bar chart example we have the birth rate in Brazil for the December months from 2010 to 2016 9 The sharp fall in December 2016 reflects the outbreak of Zika virus in the birth rate in Brazil Histograms Edit Example of a histogram The histogram or frequency distribution is a graphical representation of a dataset tabulated and divided into uniform or non uniform classes It was first introduced by Karl Pearson 11 Scatter plot Edit A scatter plot is a mathematical diagram that uses Cartesian coordinates to display values of a dataset A scatter plot shows the data as a set of points each one presenting the value of one variable determining the position on the horizontal axis and another variable on the vertical axis 12 They are also called scatter graph scatter chart scattergram or scatter diagram 13 Mean Edit Main article Mean The arithmetic mean is the sum of a collection of values x 1 x 2 x 3 x n displaystyle x 1 x 2 x 3 cdots x n divided by the number of items of this collection n displaystyle n x 1 n i 1 n x i x 1 x 2 x n n displaystyle bar x frac 1 n left sum i 1 n x i right frac x 1 x 2 cdots x n n Median Edit Main article Median The median is the value in the middle of a dataset Mode Edit Main article Mode statistics The mode is the value of a set of data that appears most often 14 Comparison among mean median and mode Values 2 3 3 3 3 3 4 4 11 Type Example ResultMean 2 3 3 3 3 3 4 4 11 9 4Median 2 3 3 3 3 3 4 4 11 3Mode 2 3 3 3 3 3 4 4 11 3Box plot Edit Box plot is a method for graphically depicting groups of numerical data The maximum and minimum values are represented by the lines and the interquartile range IQR represent 25 75 of the data Outliers may be plotted as circles Correlation coefficients Edit Although correlations between two different kinds of data could be inferred by graphs such as scatter plot it is necessary validate this though numerical information For this reason correlation coefficients are required They provide a numerical value that reflects the strength of an association 10 Pearson correlation coefficient Edit Scatter diagram that demonstrates the Pearson correlation for different values of r Pearson correlation coefficient is a measure of association between two variables X and Y This coefficient usually represented by r rho for the population and r for the sample assumes values between 1 and 1 where r 1 represents a perfect positive correlation r 1 represents a perfect negative correlation and r 0 is no linear correlation 10 Inferential statistics Edit Main article Statistical inference It is used to make inferences 15 about an unknown population by estimation and or hypothesis testing In other words it is desirable to obtain parameters to describe the population of interest but since the data is limited it is necessary to make use of a representative sample in order to estimate them With that it is possible to test previously defined hypotheses and apply the conclusions to the entire population The standard error of the mean is a measure of variability that is crucial to do inferences 5 Hypothesis testingHypothesis testing is essential to make inferences about populations aiming to answer research questions as settled in Research planning section Authors defined four steps to be set 5 The hypothesis to be tested as stated earlier we have to work with the definition of a null hypothesis H0 that is going to be tested and an alternative hypothesis But they must be defined before the experiment implementation Significance level and decision rule A decision rule depends on the level of significance or in other words the acceptable error rate a It is easier to think that we define a critical value that determines the statistical significance when a test statistic is compared with it So a also has to be predefined before the experiment Experiment and statistical analysis This is when the experiment is really implemented following the appropriate experimental design data is collected and the more suitable statistical tests are evaluated Inference Is made when the null hypothesis is rejected or not rejected based on the evidence that the comparison of p values and a brings It is pointed that the failure to reject H0 just means that there is not enough evidence to support its rejection but not that this hypothesis is true Confidence intervalsA confidence interval is a range of values that can contain the true real parameter value in given a certain level of confidence The first step is to estimate the best unbiased estimate of the population parameter The upper value of the interval is obtained by the sum of this estimate with the multiplication between the standard error of the mean and the confidence level The calculation of lower value is similar but instead of a sum a subtraction must be applied 5 Statistical considerations EditPower and statistical error Edit When testing a hypothesis there are two types of statistic errors possible Type I error and Type II error The type I error or false positive is the incorrect rejection of a true null hypothesis and the type II error or false negative is the failure to reject a false null hypothesis The significance level denoted by a is the type I error rate and should be chosen before performing the test The type II error rate is denoted by b and statistical power of the test is 1 b p value Edit The p value is the probability of obtaining results as extreme as or more extreme than those observed assuming the null hypothesis H0 is true It is also called the calculated probability It is common to confuse the p value with the significance level a but the a is a predefined threshold for calling significant results If p is less than a the null hypothesis H0 is rejected 16 Multiple testing Edit In multiple tests of the same hypothesis the probability of the occurrence of falses positives familywise error rate increase and some strategy are used to control this occurrence This is commonly achieved by using a more stringent threshold to reject null hypotheses The Bonferroni correction defines an acceptable global significance level denoted by a and each test is individually compared with a value of a a m This ensures that the familywise error rate in all m tests is less than or equal to a When m is large the Bonferroni correction may be overly conservative An alternative to the Bonferroni correction is to control the false discovery rate FDR The FDR controls the expected proportion of the rejected null hypotheses the so called discoveries that are false incorrect rejections This procedure ensures that for independent tests the false discovery rate is at most q Thus the FDR is less conservative than the Bonferroni correction and have more power at the cost of more false positives 17 Mis specification and robustness checks Edit The main hypothesis being tested e g no association between treatments and outcomes is often accompanied by other technical assumptions e g about the form of the probability distribution of the outcomes that are also part of the null hypothesis When the technical assumptions are violated in practice then the null may be frequently rejected even if the main hypothesis is true Such rejections are said to be due to model mis specification 18 Verifying whether the outcome of a statistical test does not change when the technical assumptions are slightly altered so called robustness checks is the main way of combating mis specification Model selection criteria Edit Model criteria selection will select or model that more approximate true model The Akaike s Information Criterion AIC and The Bayesian Information Criterion BIC are examples of asymptotically efficient criteria Developments and big data EditThis section needs additional citations for verification Please help improve this article by adding citations to reliable sources Unsourced material may be challenged and removed December 2016 Learn how and when to remove this template message Recent developments have made a large impact on biostatistics Two important changes have been the ability to collect data on a high throughput scale and the ability to perform much more complex analysis using computational techniques This comes from the development in areas as sequencing technologies Bioinformatics and Machine learning Machine learning in bioinformatics Use in high throughput data Edit New biomedical technologies like microarrays next generation sequencers for genomics and mass spectrometry for proteomics generate enormous amounts of data allowing many tests to be performed simultaneously 19 Careful analysis with biostatistical methods is required to separate the signal from the noise For example a microarray could be used to measure many thousands of genes simultaneously determining which of them have different expression in diseased cells compared to normal cells However only a fraction of genes will be differentially expressed 20 Multicollinearity often occurs in high throughput biostatistical settings Due to high intercorrelation between the predictors such as gene expression levels the information of one predictor might be contained in another one It could be that only 5 of the predictors are responsible for 90 of the variability of the response In such a case one could apply the biostatistical technique of dimension reduction for example via principal component analysis Classical statistical techniques like linear or logistic regression and linear discriminant analysis do not work well for high dimensional data i e when the number of observations n is smaller than the number of features or predictors p n lt p As a matter of fact one can get quite high R2 values despite very low predictive power of the statistical model These classical statistical techniques esp least squares linear regression were developed for low dimensional data i e where the number of observations n is much larger than the number of predictors p n gt gt p In cases of high dimensionality one should always consider an independent validation test set and the corresponding residual sum of squares RSS and R2 of the validation test set not those of the training set Often it is useful to pool information from multiple predictors together For example Gene Set Enrichment Analysis GSEA considers the perturbation of whole functionally related gene sets rather than of single genes 21 These gene sets might be known biochemical pathways or otherwise functionally related genes The advantage of this approach is that it is more robust It is more likely that a single gene is found to be falsely perturbed than it is that a whole pathway is falsely perturbed Furthermore one can integrate the accumulated knowledge about biochemical pathways like the JAK STAT signaling pathway using this approach Bioinformatics advances in databases data mining and biological interpretation Edit The development of biological databases enables storage and management of biological data with the possibility of ensuring access for users around the world They are useful for researchers depositing data retrieve information and files raw or processed originated from other experiments or indexing scientific articles as PubMed Another possibility is search for the desired term a gene a protein a disease an organism and so on and check all results related to this search There are databases dedicated to SNPs dbSNP the knowledge on genes characterization and their pathways KEGG and the description of gene function classifying it by cellular component molecular function and biological process Gene Ontology 22 In addition to databases that contain specific molecular information there are others that are ample in the sense that they store information about an organism or group of organisms As an example of a database directed towards just one organism but that contains much data about it is the Arabidopsis thaliana genetic and molecular database TAIR 23 Phytozome 24 in turn stores the assemblies and annotation files of dozen of plant genomes also containing visualization and analysis tools Moreover there is an interconnection between some databases in the information exchange sharing and a major initiative was the International Nucleotide Sequence Database Collaboration INSDC 25 which relates data from DDBJ 26 EMBL EBI 27 and NCBI 28 Nowadays increase in size and complexity of molecular datasets leads to use of powerful statistical methods provided by computer science algorithms which are developed by machine learning area Therefore data mining and machine learning allow detection of patterns in data with a complex structure as biological ones by using methods of supervised and unsupervised learning regression detection of clusters and association rule mining among others 22 To indicate some of them self organizing maps and k means are examples of cluster algorithms neural networks implementation and support vector machines models are examples of common machine learning algorithms Collaborative work among molecular biologists bioinformaticians statisticians and computer scientists is important to perform an experiment correctly going from planning passing through data generation and analysis and ending with biological interpretation of the results 22 Use of computationally intensive methods Edit On the other hand the advent of modern computer technology and relatively cheap computing resources have enabled computer intensive biostatistical methods like bootstrapping and re sampling methods In recent times random forests have gained popularity as a method for performing statistical classification Random forest techniques generate a panel of decision trees Decision trees have the advantage that you can draw them and interpret them even with a basic understanding of mathematics and statistics Random Forests have thus been used for clinical decision support systems citation needed Applications EditThis section is in list format but may read better as prose You can help by converting this section if appropriate Editing help is available March 2016 Public health Edit Public health including epidemiology health services research nutrition environmental health and health care policy amp management In these medicine contents it s important to consider the design and analysis of the clinical trials As one example there is the assessment of severity state of a patient with a prognosis of an outcome of a disease With new technologies and genetics knowledge biostatistics are now also used for Systems medicine which consists in a more personalized medicine For this is made an integration of data from different sources including conventional patient data clinico pathological parameters molecular and genetic data as well as data generated by additional new omics technologies 29 Quantitative genetics Edit The study of Population genetics and Statistical genetics in order to link variation in genotype with a variation in phenotype In other words it is desirable to discover the genetic basis of a measurable trait a quantitative trait that is under polygenic control A genome region that is responsible for a continuous trait is called Quantitative trait locus QTL The study of QTLs become feasible by using molecular markers and measuring traits in populations but their mapping needs the obtaining of a population from an experimental crossing like an F2 or Recombinant inbred strains lines RILs To scan for QTLs regions in a genome a gene map based on linkage have to be built Some of the best known QTL mapping algorithms are Interval Mapping Composite Interval Mapping and Multiple Interval Mapping 30 However QTL mapping resolution is impaired by the amount of recombination assayed a problem for species in which it is difficult to obtain large offspring Furthermore allele diversity is restricted to individuals originated from contrasting parents which limit studies of allele diversity when we have a panel of individuals representing a natural population 31 For this reason the Genome wide association study was proposed in order to identify QTLs based on linkage disequilibrium that is the non random association between traits and molecular markers It was leveraged by the development of high throughput SNP genotyping 32 In animal and plant breeding the use of markers in selection aiming for breeding mainly the molecular ones collaborated to the development of marker assisted selection While QTL mapping is limited due resolution GWAS does not have enough power when rare variants of small effect that are also influenced by environment So the concept of Genomic Selection GS arises in order to use all molecular markers in the selection and allow the prediction of the performance of candidates in this selection The proposal is to genotype and phenotype a training population develop a model that can obtain the genomic estimated breeding values GEBVs of individuals belonging to a genotype and but not phenotype population called testing population 33 This kind of study could also include a validation population thinking in the concept of cross validation in which the real phenotype results measured in this population are compared with the phenotype results based on the prediction what used to check the accuracy of the model As a summary some points about the application of quantitative genetics are This has been used in agriculture to improve crops Plant breeding and livestock Animal breeding In biomedical research this work can assist in finding candidates gene alleles that can cause or influence predisposition to diseases in human geneticsExpression data Edit Studies for differential expression of genes from RNA Seq data as for RT qPCR and microarrays demands comparison of conditions The goal is to identify genes which have a significant change in abundance between different conditions Then experiments are designed appropriately with replicates for each condition treatment randomization and blocking when necessary In RNA Seq the quantification of expression uses the information of mapped reads that are summarized in some genetic unit as exons that are part of a gene sequence As microarray results can be approximated by a normal distribution RNA Seq counts data are better explained by other distributions The first used distribution was the Poisson one but it underestimate the sample error leading to false positives Currently biological variation is considered by methods that estimate a dispersion parameter of a negative binomial distribution Generalized linear models are used to perform the tests for statistical significance and as the number of genes is high multiple tests correction have to be considered 34 Some examples of other analysis on genomics data comes from microarray or proteomics experiments 35 36 Often concerning diseases or disease stages 37 Other studies Edit Ecology ecological forecasting Biological sequence analysis 38 Systems biology for gene network inference or pathways analysis 39 Clinical research and pharmaceutical development Population dynamics especially in regards to fisheries science Phylogenetics and evolution Pharmacodynamics Pharmacokinetics NeuroimagingTools EditThere are a lot of tools that can be used to do statistical analysis in biological data Most of them are useful in other areas of knowledge covering a large number of applications alphabetical Here are brief descriptions of some of them ASReml Another software developed by VSNi 40 that can be used also in R environment as a package It is developed to estimate variance components under a general linear mixed model using restricted maximum likelihood REML Models with fixed effects and random effects and nested or crossed ones are allowed Gives the possibility to investigate different variance covariance matrix structures CycDesigN 41 A computer package developed by VSNi 40 that helps the researchers create experimental designs and analyze data coming from a design present in one of three classes handled by CycDesigN These classes are resolvable non resolvable partially replicated and crossover designs It includes less used designs the Latinized ones as t Latinized design 42 Orange A programming interface for high level data processing data mining and data visualization Include tools for gene expression and genomics 22 R An open source environment and programming language dedicated to statistical computing and graphics It is an implementation of S language maintained by CRAN 43 In addition to its functions to read data tables take descriptive statistics develop and evaluate models its repository contains packages developed by researchers around the world This allows the development of functions written to deal with the statistical analysis of data that comes from specific applications 44 In the case of Bioinformatics for example there are packages located in the main repository CRAN and in others as Bioconductor It is also possible to use packages under development that are shared in hosting services as GitHub SAS A data analysis software widely used going through universities services and industry Developed by a company with the same name SAS Institute it uses SAS language for programming PLA 3 0 45 Is a biostatistical analysis software for regulated environments e g drug testing which supports Quantitative Response Assays Parallel Line Parallel Logistics Slope Ratio and Dichotomous Assays Quantal Response Binary Assays It also supports weighting methods for combination calculations and the automatic data aggregation of independent assay data Weka A Java software for machine learning and data mining including tools and methods for visualization clustering regression association rule and classification There are tools for cross validation bootstrapping and a module of algorithm comparison Weka also can be run in other programming languages as Perl or R 22 Python programming language image analysis deep learning machine learning SQL databases NoSQL NumPy numerical python SciPy SageMath LAPACK linear algebra MATLAB Apache Hadoop Apache Spark Amazon Web ServicesScope and training programs EditAlmost all educational programmes in biostatistics are at postgraduate level They are most often found in schools of public health affiliated with schools of medicine forestry or agriculture or as a focus of application in departments of statistics In the United States where several universities have dedicated biostatistics departments many other top tier universities integrate biostatistics faculty into statistics or other departments such as epidemiology Thus departments carrying the name biostatistics may exist under quite different structures For instance relatively new biostatistics departments have been founded with a focus on bioinformatics and computational biology whereas older departments typically affiliated with schools of public health will have more traditional lines of research involving epidemiological studies and clinical trials as well as bioinformatics In larger universities around the world where both a statistics and a biostatistics department exist the degree of integration between the two departments may range from the bare minimum to very close collaboration In general the difference between a statistics program and a biostatistics program is twofold i statistics departments will often host theoretical methodological research which are less common in biostatistics programs and ii statistics departments have lines of research that may include biomedical applications but also other areas such as industry quality control business and economics and biological areas other than medicine Specialized journals EditSee also List of biostatistics journals Biostatistics 46 International Journal of Biostatistics 47 Journal of Epidemiology and Biostatistics 48 Biostatistics and Public Health 49 Biometrics 50 Biometrika 51 Biometrical Journal 52 Communications in Biometry and Crop Science 53 Statistical Applications in Genetics and Molecular Biology 54 Statistical Methods in Medical Research 55 Pharmaceutical Statistics 56 Statistics in Medicine 57 See also EditBioinformatics Epidemiological method Epidemiology Group size measures Health indicator Mathematical and theoretical biologyReferences Edit Centre for Transformative Innovation Swinburne University of Technology Allan Frances Elizabeth Betty Person Encyclopedia of Australian Science and Innovation www eoas info Retrieved 2022 10 26 Gunter Chris 10 December 2008 Quantitative Genetics Nature 456 7223 719 Bibcode 2008Natur 456 719G doi 10 1038 456719a PMID 19079046 Charles T Munger 2003 10 03 Academic Economics Strengths and Faults After Considering Interdisciplinary Needs PDF Archived PDF from the original on 2022 10 09 a b c Nizamuddin Sarah L Nizamuddin Junaid Mueller Ariel Ramakrishna Harish Shahul Sajid S October 2017 Developing a Hypothesis and Statistical Planning Journal of Cardiothoracic and Vascular Anesthesia 31 5 1878 1882 doi 10 1053 j jvca 2017 04 020 PMID 28778775 a b c d Overholser Brian R Sowinski Kevin M 2017 Biostatistics Primer Part I Nutrition in Clinical Practice 22 6 629 35 doi 10 1177 0115426507022006629 PMID 18042950 Szczech Lynda Anne Coladonato Joseph A Owen William F 4 October 2002 Key Concepts in Biostatistics Using Statistics to Answer the Question Is There a Difference Seminars in Dialysis 15 5 347 351 doi 10 1046 j 1525 139X 2002 00085 x PMID 12358639 S2CID 30875225 Sandelowski Margarete 2000 Combining Qualitative and Quantitative Sampling Data Collection and Analysis Techniques in Mixed Method Studies Research in Nursing amp Health 23 3 246 255 CiteSeerX 10 1 1 472 7825 doi 10 1002 1098 240X 200006 23 3 lt 246 AID NUR9 gt 3 0 CO 2 H PMID 10871540 Maths Sangaku Absolute relative cumulative frequency and statistical tables Probability and Statistics www sangakoo com Retrieved 2018 04 10 a b DATASUS TabNet Win32 3 0 Nascidos vivos Brasil DATASUS Tecnologia da Informacao a Servico do SUS a b c d Forthofer Ronald N Lee Eun Sul 1995 Introduction to Biostatistics A Guide to Design Analysis and Discovery Academic Press ISBN 978 0 12 262270 0 Pearson Karl 1895 01 01 X Contributions to the mathematical theory of evolution II Skew variation in homogeneous material Phil Trans R Soc Lond A 186 343 414 Bibcode 1895RSPTA 186 343P doi 10 1098 rsta 1895 0010 ISSN 0264 3820 Utts Jessica M 2005 Seeing through statistics 3rd ed Belmont CA Thomson Brooks Cole ISBN 978 0534394028 OCLC 56568530 Jarrell Stephen B 1994 Basic statistics Dubuque Iowa Wm C Brown Pub ISBN 978 0697215956 OCLC 30301196 Gujarati Damodar N 2006 Econometrics McGraw Hill Irwin Essentials of Biostatistics in Public Health amp Essentials of Biostatistics Workbook Statistical Computing Using Excel Australian and New Zealand Journal of Public Health 33 2 196 197 2009 doi 10 1111 j 1753 6405 2009 00372 x ISSN 1326 0200 Baker Monya 2016 Statisticians issue warning over misuse of P values Nature 531 7593 151 Bibcode 2016Natur 531 151B doi 10 1038 nature 2016 19503 PMID 26961635 Benjamini Y amp Hochberg Y Controlling the False Discovery Rate A Practical and Powerful Approach to Multiple Testing Journal of the Royal Statistical Society Series B Methodological 57 289 300 1995 Null hypothesis www statlect com Retrieved 2018 05 08 Hayden Erika Check 8 February 2012 Biostatistics Revealing analysis Nature 482 7384 263 265 doi 10 1038 nj7384 263a PMID 22329008 Efron Bradley February 2008 Microarrays Empirical Bayes and the Two Groups Model Statistical Science 23 1 1 22 arXiv 0808 0572 doi 10 1214 07 STS236 S2CID 8417479 Subramanian A Tamayo P Mootha V K Mukherjee S Ebert B L Gillette M A Paulovich A Pomeroy S L Golub T R Lander E S Mesirov J P 30 September 2005 Gene set enrichment analysis A knowledge based approach for interpreting genome wide expression profiles Proceedings of the National Academy of Sciences 102 43 15545 15550 Bibcode 2005PNAS 10215545S doi 10 1073 pnas 0506580102 PMC 1239896 PMID 16199517 a b c d e Moore Jason H 2007 Bioinformatics Journal of Cellular Physiology 213 2 365 9 doi 10 1002 jcp 21218 PMID 17654500 S2CID 221831488 TAIR Home Page www arabidopsis org Phytozome phytozome jgi doe gov International Nucleotide Sequence Database Collaboration INSDC www insdc org Top www ddbj nig ac jp The European Bioinformatics Institute lt EMBL EBI www ebi ac uk National Center for Biotechnology Information www ncbi nlm nih gov U S National Library of Medicine Apweiler Rolf et al 2018 Whither systems medicine Experimental amp Molecular Medicine 50 3 e453 doi 10 1038 emm 2017 290 PMC 5898894 PMID 29497170 Zeng Zhao Bang 2005 QTL mapping and the genetic basis of adaptation Recent developments Genetica 123 1 2 25 37 doi 10 1007 s10709 004 2705 0 PMID 15881678 S2CID 1094152 Korte Arthur Farlow Ashley 2013 The advantages and limitations of trait analysis with GWAS A review Plant Methods 9 29 doi 10 1186 1746 4811 9 29 PMC 3750305 PMID 23876160 Zhu Chengsong Gore Michael Buckler Edward S Yu Jianming 2008 Status and Prospects of Association Mapping in Plants The Plant Genome 1 5 20 doi 10 3835 plantgenome2008 02 0089 Crossa Jose Perez Rodriguez Paulino Cuevas Jaime Montesinos Lopez Osval Jarquin Diego De Los Campos Gustavo Burgueno Juan Gonzalez Camacho Juan M Perez Elizalde Sergio Beyene Yoseph Dreisigacker Susanne Singh Ravi Zhang Xuecai Gowda Manje Roorkiwal Manish Rutkoski Jessica Varshney Rajeev K 2017 Genomic Selection in Plant Breeding Methods Models and Perspectives PDF Trends in Plant Science 22 11 961 975 doi 10 1016 j tplants 2017 08 011 PMID 28965742 Archived PDF from the original on 2022 10 09 Oshlack Alicia Robinson Mark D Young Matthew D 2010 From RNA seq reads to differential expression results Genome Biology 11 12 220 doi 10 1186 gb 2010 11 12 220 PMC 3046478 PMID 21176179 Helen Causton John Quackenbush Alvis Brazma 2003 Statistical Analysis of Gene Expression Microarray Data Wiley Blackwell Terry Speed 2003 Microarray Gene Expression Data Analysis A Beginner s Guide Chapman amp Hall CRC Frank Emmert Streib Matthias Dehmer 2010 Medical Biostatistics for Complex Diseases Wiley Blackwell ISBN 978 3 527 32585 6 Warren J Ewens Gregory R Grant 2004 Statistical Methods in Bioinformatics An Introduction Springer Matthias Dehmer Frank Emmert Streib Armin Graber Armindo Salvador 2011 Applied Statistics for Network Biology Methods in Systems Biology Wiley Blackwell ISBN 978 3 527 32750 8 a b Home VSN International www vsni co uk CycDesigN VSN International www vsni co uk Piepho Hans Peter Williams Emlyn R Michel Volker 2015 Beyond Latin Squares A Brief Tour of Row Column Designs Agronomy Journal 107 6 2263 doi 10 2134 agronj15 0144 The Comprehensive R Archive Network cran r project org Renganathan V 2021 Biostatistics explored through R software An overview ISBN 9789354936586 Stegmann Dr Ralf 2019 07 01 PLA 3 0 PLA 3 0 Software for Biostatistical Analysis Retrieved 2019 07 02 Biostatistics Oxford Academic OUP Academic The International Journal of Biostatistics PubMed Journals will be shut down 15 June 2018 https ebph it Epidemiology Biometrics onlinelibrary wiley com doi 10 1111 ISSN 1541 0420 Biometrika Oxford Academic OUP Academic Biometrical Journal onlinelibrary wiley com doi 10 1002 ISSN 1521 4036 Communications in Biometry and Crop Science agrobiol sggw waw pl Statistical Applications in Genetics and Molecular Biology www degruyter com 1 May 2002 Statistical Methods in Medical Research SAGE Journals Pharmaceutical Statistics onlinelibrary wiley com Statistics in Medicine onlinelibrary wiley com doi 10 1002 ISSN 1097 0258 External links Edit Media related to Biostatistics at Wikimedia Commons The International Biometric Society The Collection of Biostatistics Research Archive Guide to Biostatistics MedPageToday com Biomedical Statistics Retrieved from https en wikipedia org w index php title Biostatistics amp oldid 1122845167, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.