Biostatistics - A Foundation for Analysis in the Health Sciences (10th Ed)-BIOE 440 - docshare.tips (2023)

3GFFIRS 11/28/2012 15:43:56 Page 2
3GFFIRS 11/28/2012 15:43:56 Page 1
T E NT H E DI T I ON
BIOSTATISTICS
A Foundation for Analysis
in the Health Sciences
3GFFIRS 11/28/2012 15:43:56 Page 2
3GFFIRS 11/28/2012 15:43:56 Page 3
T E NT H E DI T I ON
BIOSTATISTICS
A Foundation for Analysis
in the Health Sciences
WAYNE W. DANI EL, PH. D.
Professor Emeritus
Georgia State University
CHAD L. CROSS, PH. D. , PSTAT
R
Statistician
Office of Informatics and Analytics
Veterans Health Administration
Associate Graduate Faculty
University of Nevada, Las Vegas
3GFFIRS 11/28/2012 15:43:56 Page 4
This book was set in 10/12pt, Times Roman by Thomson Digital and printed and bound by Edwards Brothers Malloy.
The cover was printed by Edwards Brothers Malloy.
This book is printed on acid free paper. 1
Founded in 1807, John Wiley & Sons, Inc. has been a valued source of knowledge and understanding for more
than 200 years, helping people around the world meet their needs and fulfill their aspirations. Our company is
built on a foundation of principles that include responsibility to the communities we serve and where we live and
work. In 2008, we launched a Corporate Citizenship Initiative, a global effort to address the environmental,
social, economic, and ethical challenges we face in our business. Among the issues we are addressing are carbon
impact, paper specifications and procurement, ethical conduct within our business and among our vendors, and
community and charitable support. For more information, please visit our website: www.wiley.com/go/
citizenship.
Copyright #2013, 2009, 2005, 1999 John Wiley & Sons, Inc. All rights reserved. No part of this publication
may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic,
mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of
the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or
authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc. 222
Rosewood Drive, Danvers, MA 01923, website www.copyright.com. Requests to the Publisher for permission
should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ
07030-5774, (201)748-6011, fax (201)748-6008, website http://www.wiley.com/go/permissions.
Evaluation copies are provided to qualified academics and professionals for review purposes only, for use in their
courses during the next academic year. These copies are licensed and may not be sold or transferred to a third
party. Upon completion of the review period, please return the evaluation copy to Wiley. Return instructions and
a free of charge return mailing label are available at www.wiley.com/go/returnlabel. If you have chosen to adopt
this textbook for use in your course, please accept this book as your complimentary desk copy. Outside of the
United States, please contact your local sales representative.
Library of Congress Cataloging-in-Publication Data
Daniel, Wayne W., 1929-
Biostatistics : a foundation for analysis in the health sciences / Wayne W.
Daniel, Chad Lee Cross. — Tenth edition.
pages cm
Includes index.
ISBN 978-1-118-30279-8 (cloth)
1. Medical statistics. 2. Biometry. I. Cross, Chad Lee, 1971- II. Title.
RA409.D35 2013
610.72
0
7—dc23 2012038459
Printed in the United States of America
10 9 8 7 6 5 4 3 2 1
VP & EXECUTIVE PUBLISHER:
ACQUISITIONS EDITOR:
PROJECT EDITOR:
MARKETING MANAGER:
MARKETING ASSISTANT:
PHOTO EDITOR:
DESIGNER:
PRODUCTION MANAGEMENT SERVICES:
ASSOCIATE PRODUCTION MANAGER:
PRODUCTION EDITOR:
COVER PHOTO CREDIT:
Laurie Rosatone
Shannon Corliss
Ellen Keohane
Melanie Kurkjian
Patrick Flatley
Sheena Goldstein
Kenji Ngieng
Thomson Digital
Joyce Poh
Jolene Ling
#ktsimage/iStockphoto
3GFFIRS 11/28/2012 15:43:56 Page 5
Dr. Daniel
To my children, Jean, Carolyn,
and John, and to the memory of
their mother, my wife, Mary.
Dr. Cross
To my wife Pamela
and to my children, Annabella Grace
and Breanna Faith.
3GFFIRS 11/28/2012 15:43:56 Page 6
3GFPREF 11/08/2012 1:59:19 Page 7
PREFACE
This 10th edition of Biostatistics: A Foundation for Analysis in the Health Sciences was
prepared with the objective of appealing to a wide audience. Previous editions of the book
have been used by the authors and their colleagues in a variety of contexts. For under-
graduates, this edition should provide an introduction to statistical concepts for students in
the biosciences, health sciences, and for mathematics majors desiring exposure to applied
statistical concepts. Like its predecessors, this edition is designed to meet the needs of
beginning graduate students in various fields such as nursing, applied sciences, and public
health who are seeking a strong foundation in quantitative methods. For professionals
already working in the health field, this edition can serve as a useful desk reference.
The breadth of coverage provided in this text, along with the hundreds of practical
exercises, allow instructors extensive flexibility in designing courses at many levels. To
that end, we offer below some ideas on topical coverage that we have found to be useful in
the classroom setting.
Like the previous editions of this book, this edition requires few mathematical pre-
requisites beyond a solid proficiency in college algebra. We have maintained an emphasis
on practical and intuitive understanding of principles rather than on abstract concepts that
underlie some methods, and that require greater mathematical sophistication. With that in
mind, we have maintained a reliance on problem sets and examples taken directly from the
health sciences literature instead of contrived examples. We believe that this makes the text
more interesting for students, and more practical for practicing health professionals who
reference the text while performing their work duties.
For most of the examples and statistical techniques covered in this edition, we
discuss the use of computer software for calculations. Experience has informed our
decision to include example printouts from a variety of statistical software in this edition
(e.g., MINITAB, SAS, SPSS, and R). We feel that the inclusion of examples from these
particular packages, which are generally the most commonly utilized by practitioners,
provides a rich presentation of the material and allows the student the opportunity to
appreciate the various technologies used by practicing statisticians.
CHANGES ANDUPDATES TOTHIS EDITION
The majority of the chapters include corrections and clarifications that enhance the material
that is presented and make it more readable and accessible to the audience. We did,
however, make several specific changes and improvements that we believe are valuable
contributions to this edition, and we thank the reviewers of the previous edition for their
comments and suggestions in that regard.
vii
3GFPREF 11/08/2012 1:59:19 Page 8
Specific changes to this edition include additional text concerning measures of
dispersion in Chapter 2, additional text and examples using program R in Chapter 6, a new
introduction to linear models in Chapter 8 that ties together the regression and ANOVA
concepts in Chapters 8–11, the addition of two-factor repeated measures ANOVA in
Chapter 8, a discussion of the similarities of ANOVA and regression in Chapter 11,
and extensive new text and examples on testing the fit of logistic regression models in
Chapter 11.
Most important to this new edition is a new Chapter 14 on Survival Analysis. This
new chapter was borne out of requests from reviewers of the text and from the experience
of the authors in terms of the growing use of these methods in applied research. In this
new chapter, we included some of the material found in Chapter 12 in previous editions,
and added extensive material and examples. We provide introductory coverage of
censoring, Kaplan–Meier estimates, methods for comparing survival curves, and the
Cox Regression Proportional Hazards model. Owing to this new material, we elected
to move the contents of the vital statistics chapter to a new Chapter 15 and make it
avai labl e o nl ine (w ww. wi ley. com/colleg e/ daniel).
COURSE COVERAGE IDEAS
In the table below we provide some suggestions for topical coverage in a variety of
contexts, with “X” indicating those chapters we believe are most relevant for a variety of
courses for which this text is appropriate. The text has been designed to be flexible in order
to accommodate various teaching styles and various course presentations. Although the
text is designed with progressive presentation of concepts in mind, certain of the topics may
be skipped or covered briefly so that focus can be placed on concepts important to
instructors.
Course Chapters
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Undergraduate course for health
sciences students
X X X X X X X X X O O X O O O
Undergraduate course in
applied statistics for
mathematics majors
X O O O X X X X X X O X X X O
First biostatistics course for
beginning graduate students
X X X X X X X X X X O X X X O
Biostatistics course for graduate
health sciences students who
have completed an introductory
statistics course
X O O O O X X X X X X X X X X
X: Suggested coverage; O: Optional coverage.
viii PREFACE
3GFPREF 11/08/2012 1:59:19 Page 9
SUPPLEMENTS
Instructor’s Solutions Manual. Prepared by Dr. Chad Cross, this manual includes
solutions to all problems found in the text. This manual is available only to instructors
who have adopted the text.
Student Solutions Manual. Prepared by Dr. Chad Cross, this manual includes solutions
to all odd-numbered exercises. This manual may be packaged with the text at a discounted
price.
Data Sets. More than 250 data sets are available online to accompany the text. These data
sets include those data presented in examples, exercises, review exercises, and the large
data sets found in some chapters. These are available in SAS, SPSS, and Minitab formats
as well as CSV format for importing into other programs. Data are available for down-
loading at
www.wiley.com /college/daniel
Those without Internet access may contact Wiley directly at 111 River Street, Hoboken, NJ
07030-5774; telephone: 1-877-762-2974.
ACKNOWLEDGMENTS
Many reviewers, students, and faculty have made contributions to this text through their
careful review, inquisitive questions, and professional discussion of topics. In particular,
we would like to thank Dr. Sheniz Moonie of the University of Nevada, Las Vegas; Dr. Roy
T. Sabo of Virginia Commonwealth University; and Dr. Derek Webb, Bemidji State
University for their useful comments on the ninth edition of this text.
There are three additional important acknowledgments that must be made to
important contributors of the text. Dr. John. P. Holcomb of Cleveland State University
updated many of the examples and exercises found in the text. Dr. Edward Danial of
Morgan State University provided an extensive accuracy review of the ninth edition of the
text, and his valuable comments added greatly to the book. Dr. Jodi B. A. McKibben of the
Uniformed Services University of the Health Sciences provided an extensive accuracy
review of the current edition of the book.
We wish to acknowledge the cooperation of Minitab, Inc. for making available
to the authors over many years and editions of the book the latest versions of their
software.
Thanks are due to Professors Geoffrey Churchill and Brian Schott of Georgia State
University who wrote computer programs for generating some of the Appendix tables,
and to Professor Lillian Lin, who read and commented on the logistic regression material
in earlier editions of the book. Additionally, Dr. James T. Wassell provided useful
PREFACE ix
3GFPREF 11/08/2012 1:59:19 Page 10
assistance with some of the survival analysis methods presented in earlier editions of
the text.
We are grateful to the many researchers in the health sciences field who publish their
results and hence make available data that provide valuable practice to the students of
biostatistics.
Wayne W. Daniel
Chad L. Cross
Ã
Ã
The views presented in this book are those of the author and do not necessarily represent the views of the U.S.
Department of Veterans Affairs.
x PREFACE
3GFTOC 11/08/2012 2:16:14 Page 11
BRIEF CONTENTS
1 INTRODUCTIONTOBIOSTATISTICS 1
2 DESCRIPTIVE STATISTICS 19
3 SOME BASIC PROBABILITY
CONCEPTS 65
4 PROBABILITY DISTRIBUTIONS 92
5 SOME IMPORTANT SAMPLING
DISTRIBUTIONS 134
6 ESTIMATION 161
7 HYPOTHESIS TESTING 214
8 ANALYSIS OF VARIANCE 304
9 SIMPLE LINEAR REGRESSIONAND
CORRELATION 413
10 MULTIPLE REGRESSIONAND
CORRELATION 489
11 REGRESSIONANALYSIS: SOME
ADDITIONAL TECHNIQUES 539
12 THE CHI-SQUARE DISTRIBUTION
ANDTHE ANALYSIS OF
FREQUENCIES 600
13 NONPARAMETRIC AND
DISTRIBUTION-FREE STATISTICS 670
14 SURVIVAL ANALYSIS 750
15 VITAL STATISTICS (ONLINE)
APPENDIX: STATISTICAL TABLES A-1
ANSWERS TOODD-NUMBERED
EXERCISES A-107
INDEX I-1
xi
3GFTOC 11/08/2012 2:16:14 Page 12
3GFTOC 11/08/2012 2:16:14 Page 13
CONTENTS
1 INTRODUCTIONTOBIOSTATISTICS 1
1.1 Introduction 2
1.2 Some Basic Concepts 2
1.3 Measurement and Measurement Scales 5
1.4 Sampling and Statistical Inference 7
1.5 The Scientific Method and the Design of
Experiments 13
1.6 Computers and Biostatistical Analysis 15
1.7 Summary 16
Review Questions and Exercises 17
References 18
2 DESCRIPTIVE STATISTICS 19
2.1 Introduction 20
2.2 The Ordered Array 20
2.3 Grouped Data: The Frequency Distribution 22
2.4 Descriptive Statistics: Measures of Central
Tendency 38
2.5 Descriptive Statistics: Measures of Dispersion 43
2.6 Summary 55
Review Questions and Exercises 57
References 63
3 SOME BASIC PROBABILITY
CONCEPTS 65
3.1 Introduction 65
3.2 Two Views of Probability: Objective and
Subjective 66
3.3 Elementary Properties of Probability 68
3.4 Calculating the Probability of an Event 69
3.5 Bayes’ Theorem, Screening Tests, Sensitivity,
Specificity, and Predictive Value Positive and
Negative 78
3.6 Summary 84
Review Questions and Exercises 85
References 90
4 PROBABILITY DISTRIBUTIONS 92
4.1 Introduction 93
4.2 Probability Distributions of Discrete
Variables 93
4.3 The Binomial Distribution 99
4.4 The Poisson Distribution 108
4.5 Continuous Probability Distributions 113
4.6 The Normal Distribution 116
4.7 Normal Distribution Applications 122
4.8 Summary 128
Review Questions and Exercises 130
References 133
5 SOME IMPORTANT SAMPLING
DISTRIBUTIONS 134
5.1 Introduction 134
5.2 Sampling Distributions 135
5.3 Distribution of the Sample Mean 136
5.4 Distribution of the Difference Between Two
Sample Means 145
5.5 Distribution of the Sample Proportion 150
5.6 Distribution of the Difference Between Two
Sample Proportions 154
5.7 Summary 157
Review Questions and Exercises 158
References 160
6 ESTIMATION 161
6.1 Introduction 162
6.2 Confidence Interval for a Population Mean 165
xiii
3GFTOC 11/08/2012 2:16:15 Page 14
6.3 The t Distribution 171
6.4 Confidence Interval for the Difference Between
Two Population Means 177
6.5 Confidence Interval for a Population
Proportion 185
6.6 Confidence Interval for the Difference
Between Two Population
Proportions 187
6.7 Determination of Sample Size for Estimating
Means 189
6.8 Determination of Sample Size for Estimating
Proportions 191
6.9 Confidence Interval for the Variance
of a Normally Distributed
Population 193
6.10 Confidence Interval for the Ratio of the
Variances of Two Normally Distributed
Populations 198
6.11 Summary 203
Review Questions and Exercises 205
References 210
7 HYPOTHESIS TESTING 214
7.1 Introduction 215
7.2 Hypothesis Testing: A Single Population
Mean 222
7.3 Hypothesis Testing: The Difference Between Two
Population Means 236
7.4 Paired Comparisons 249
7.5 Hypothesis Testing: A Single Population
Proportion 257
7.6 Hypothesis Testing: The Difference Between Two
Population Proportions 261
7.7 Hypothesis Testing: A Single Population
Variance 264
7.8 Hypothesis Testing: The Ratio of Two Population
Variances 267
7.9 The Type II Error and the Power of
a Test 272
7.10 Determining Sample Size to Control Type II
Errors 277
7.11 Summary 280
Review Questions and Exercises 282
References 300
8 ANALYSIS OF VARIANCE 304
8.1 Introduction 305
8.2 The Completely Randomized Design 308
8.3 The Randomized Complete Block
Design 334
8.4 The Repeated Measures Design 346
8.5 The Factorial Experiment 358
8.6 Summary 373
Review Questions and Exercises 376
References 408
9 SIMPLE LINEAR REGRESSIONAND
CORRELATION 413
9.1 Introduction 414
9.2 The Regression Model 414
9.3 The Sample Regression Equation 417
9.4 Evaluating the Regression Equation 427
9.5 Using the Regression Equation 441
9.6 The Correlation Model 445
9.7 The Correlation Coefficient 446
9.8 Some Precautions 459
9.9 Summary 460
Review Questions and Exercises 464
References 486
10 MULTIPLE REGRESSIONAND
CORRELATION 489
10.1 Introduction 490
10.2 The Multiple Linear Regression
Model 490
10.3 Obtaining the Multiple Regression
Equation 492
10.4 Evaluating the Multiple Regression
Equation 501
10.5 Using the Multiple Regression
Equation 507
10.6 The Multiple Correlation Model 510
10.7 Summary 523
Review Questions and Exercises 525
References 537
xiv CONTENTS
3GFTOC 11/08/2012 2:16:15 Page 15
11 REGRESSIONANALYSIS: SOME
ADDITIONAL TECHNIQUES 539
11.1 Introduction 540
11.2 Qualitative Independent Variables 543
11.3 Variable Selection Procedures 560
11.4 Logistic Regression 569
11.5 Summary 582
Review Questions and Exercises 583
References 597
12 THE CHI-SQUARE DISTRIBUTIONAND
THE ANALYSIS OF FREQUENCIES 600
12.1 Introduction 601
12.2 The Mathematical Properties of the Chi-Square
Distribution 601
12.3 Tests of Goodness-of-Fit 604
12.4 Tests of Independence 619
12.5 Tests of Homogeneity 630
12.6 The Fisher Exact Test 636
12.7 Relative Risk, Odds Ratio, and the
Mantel–Haenszel Statistic 641
12.8 Summary 655
Review Questions and Exercises 657
References 666
13 NONPARAMETRIC AND
DISTRIBUTION-FREE STATISTICS 670
13.1 Introduction 671
13.2 Measurement Scales 672
13.3 The Sign Test 673
13.4 The Wilcoxon Signed-Rank Test for
Location 681
13.5 The Median Test 686
13.6 The Mann–Whitney Test 690
13.7 The Kolmogorov–Smirnov Goodness-of-Fit
Test 698
13.8 The Kruskal–Wallis One-Way Analysis of Variance
by Ranks 704
13.9 The Friedman Two-Way Analysis of Variance by
Ranks 712
13.10 The Spearman Rank Correlation
Coefficient 718
13.11 Nonparametric Regression Analysis 727
13.12 Summary 730
Review Questions and Exercises 732
References 747
14 SURVIVAL ANALYSIS 750
14.1 Introduction 750
14.2 Time-to-Event Data and Censoring 751
14.3 The Kaplan–Meier Procedure 756
14.4 Comparing Survival Curves 763
14.5 Cox Regression: The Proportional Hazards
Model 768
14.6 Summary 773
Review Questions and Exercises 774
References 777
15 VITAL STATISTICS (ONLINE)
www.wiley.com/college/daniel
15.1 Introduction
15.2 Death Rates and Ratios
15.3 Measures of Fertility
15.4 Measures of Morbidity
15.5 Summary
Review Questions and Exercises
References
APPENDIX: STATISTICAL TABLES A-1
ANSWERS TOODD-NUMBERED
EXERCISES A-107
INDEX I-1
CONTENTS xv
3GFTOC 11/08/2012 2:16:15 Page 16
3GC01 11/07/2012 21:50:37 Page 1
CHAPTER 1
INTRODUCTION TO
BIOSTATISTICS
CHAPTER OVERVIEW
This chapter is intended to provide an overview of the basic statistical
concepts used throughout the textbook. A course in statistics requires the
student to learn many new terms and concepts. This chapter lays the founda-
tion necessary for understanding basic statistical terms and concepts and the
role that statisticians play in promoting scientific discovery and wisdom.
TOPICS
1.1 INTRODUCTION
1.2 SOME BASIC CONCEPTS
1.3 MEASUREMENT AND MEASUREMENT SCALES
1.4 SAMPLING AND STATISTICAL INFERENCE
1.5 THE SCIENTIFIC METHOD AND THE DESIGN OF EXPERIMENTS
1.6 COMPUTERS AND BIOSTATISTICAL ANALYSIS
1.7 SUMMARY
LEARNING OUTCOMES
After studying this chapter, the student will
1. understand the basic concepts and terminology of biostatistics, including the
various kinds of variables, measurement, and measurement scales.
2. be able to select a simple random sample and other scientific samples from a
population of subjects.
3. understand the processes involved in the scientific method and the design of
experiments.
4. appreciate the advantages of using computers in the statistical analysis of data
generated by studies and experiments conducted by researchers in the health
sciences.
1
3GC01 11/07/2012 21:50:37 Page 2
1.1 INTRODUCTION
We are frequently reminded of the fact that we are living in the information age.
Appropriately, then, this book is about information—how it is obtained, how it is analyzed,
and how it is interpreted. The information about which we are concerned we call data, and
the data are available to us in the form of numbers.
The objectives of this book are twofold: (1) to teach the student to organize and
summarize data, and (2) to teach the student how to reach decisions about a large body of
data by examining only a small part of it. The concepts and methods necessary for
achieving the first objective are presented under the heading of descriptive statistics, and
the second objective is reached through the study of what is called inferential statistics.
This chapter discusses descriptive statistics. Chapters 2 through 5 discuss topics that form
the foundation of statistical inference, and most of the remainder of the book deals with
inferential statistics.
Because this volume is designed for persons preparing for or already pursuing a
career in the health field, the illustrative material and exercises reflect the problems and
activities that these persons are likely to encounter in the performance of their duties.
1.2 SOME BASIC CONCEPTS
Like all fields of learning, statistics has its own vocabulary. Some of the words and phrases
encountered in the study of statistics will be new to those not previously exposed to the
subject. Other terms, though appearing to be familiar, may have specialized meanings that
are different from the meanings that we are accustomed to associating with these terms.
The following are some terms that we will use extensively in this book.
Data The raw material of statistics is data. For our purposes we may define data as
numbers. The two kinds of numbers that we use in statistics are numbers that result from
the taking—in the usual sense of the term—of a measurement, and those that result
from the process of counting. For example, when a nurse weighs a patient or takes
a patient’s temperature, a measurement, consisting of a number such as 150 pounds or
100 degrees Fahrenheit, is obtained. Quite a different type of number is obtained when a
hospital administrator counts the number of patients—perhaps 20—discharged from the
hospital on a given day. Each of the three numbers is a datum, and the three taken
together are data.
Statistics The meaning of statistics is implicit in the previous section. More
concretely, however, we may say that statistics is a field of study concerned with (1)
the collection, organization, summarization, and analysis of data; and (2) the drawing of
inferences about a body of data when only a part of the data is observed.
The person who performs these statistical activities must be prepared to interpret and
to communicate the results to someone else as the situation demands. Simply put, we may
say that data are numbers, numbers contain information, and the purpose of statistics is to
investigate and evaluate the nature and meaning of this information.
2 CHAPTER 1 INTRODUCTION TO BIOSTATISTICS
3GC01 11/07/2012 21:50:37 Page 3
Sources of Data The performance of statistical activities is motivated by the
need to answer a question. For example, clinicians may want answers to questions
regarding the relative merits of competing treatment procedures. Administrators may
want answers to questions regarding such areas of concern as employee morale or
facility utilization. When we determine that the appropriate approach to seeking an
answer to a question will require the use of statistics, we begin to search for suitable data
to serve as the raw material for our investigation. Such data are usually available from
one or more of the following sources:
1. Routinely kept records. It is difficult to imagine any type of organization that
does not keep records of day-to-day transactions of its activities. Hospital medical
records, for example, contain immense amounts of information on patients, while
hospital accounting records contain a wealth of data on the facility’s business
activities. When the need for data arises, we should look for them first among
routinely kept records.
2. Surveys. If the data needed to answer a question are not available from routinely
kept records, the logical source may be a survey. Suppose, for example, that the
administrator of a clinic wishes to obtain information regarding the mode of
transportation used by patients to visit the clinic. If admission forms do not contain
a question on mode of transportation, we may conduct a survey among patients to
obtain this information.
3. Experiments. Frequently the data needed to answer a question are available only as
the result of an experiment. A nurse may wish to know which of several strategies is
best for maximizing patient compliance. The nurse might conduct an experiment in
which the different strategies of motivating compliance are tried with different
patients. Subsequent evaluation of the responses to the different strategies might
enable the nurse to decide which is most effective.
4. External sources. The data needed to answer a question may already exist in the
form of published reports, commercially available data banks, or the research
literature. In other words, we may find that someone else has already asked the
same question, and the answer obtained may be applicable to our present
situation.
Biostatistics The tools of statistics are employed in many fields—business,
education, psychology, agriculture, and economics, to mention only a few. When the
data analyzed are derived from the biological sciences and medicine, we use the term
biostatistics to distinguish this particular application of statistical tools and concepts. This
area of application is the concern of this book.
Variable If, as we observe a characteristic, we find that it takes on different values
in different persons, places, or things, we label the characteristic a variable. We do this
for the simple reason that the characteristic is not the same when observed in different
possessors of it. Some examples of variables include diastolic blood pressure, heart rate,
the heights of adult males, the weights of preschool children, and the ages of patients
seen in a dental clinic.
1.2 SOME BASIC CONCEPTS 3
3GC01 11/07/2012 21:50:37 Page 4
Quantitative Variables A quantitative variable is one that can be measured in
the usual sense. We can, for example, obtain measurements on the heights of adult males,
the weights of preschool children, and the ages of patients seen in a dental clinic. These are
examples of quantitative variables. Measurements made on quantitative variables convey
information regarding amount.
Qualitative Variables Some characteristics are not capable of being measured
in the sense that height, weight, and age are measured. Many characteristics can be
categorized only, as, for example, when an ill person is given a medical diagnosis, a
person is designated as belonging to an ethnic group, or a person, place, or object is
said to possess or not to possess some characteristic of interest. In such cases
measuring consists of categorizing. We refer to variables of this kind as qualitative
variables. Measurements made on qualitative variables convey information regarding
attribute.
Although, in the case of qualitative variables, measurement in the usual sense of the
word is not achieved, we can count the number of persons, places, or things belonging to
various categories. A hospital administrator, for example, can count the number of patients
admitted during a day under each of the various admitting diagnoses. These counts, or
frequencies as they are called, are the numbers that we manipulate when our analysis
involves qualitative variables.
Random Variable Whenever we determine the height, weight, or age of an
individual, the result is frequently referred to as a value of the respective variable.
When the values obtained arise as a result of chance factors, so that they cannot be
exactly predicted in advance, the variable is called a random variable. An example of a
random variable is adult height. When a child is born, we cannot predict exactly his or her
height at maturity. Attained adult height is the result of numerous genetic and environ-
mental factors. Values resulting from measurement procedures are often referred to as
observations or measurements.
Discrete Random Variable Variables may be characterized further as to
whether they are discrete or continuous. Since mathematically rigorous definitions of
discrete and continuous variables are beyond the level of this book, we offer, instead,
nonrigorous definitions and give an example of each.
A discrete variable is characterized by gaps or interruptions in the values that it can
assume. These gaps or interruptions indicate the absence of values between particular
values that the variable can assume. Some examples illustrate the point. The number of
daily admissions to a general hospital is a discrete random variable since the number of
admissions each day must be represented by a whole number, such as 0, 1, 2, or 3. The
number of admissions on a given day cannot be a number such as 1.5, 2.997, or 3.333. The
number of decayed, missing, or filled teeth per child in an elementary school is another
example of a discrete variable.
Continuous Random Variable A continuous random variable does not
possess the gaps or interruptions characteristic of a discrete random variable. A
continuous random variable can assume any value within a specified relevant interval
4 CHAPTER 1 INTRODUCTION TO BIOSTATISTICS
3GC01 11/07/2012 21:50:37 Page 5
of values assumed by the variable. Examples of continuous variables include the various
measurements that can be made on individuals such as height, weight, and skull
circumference. No matter how close together the observed heights of two people, for
example, we can, theoretically, find another person whose height falls somewhere in
between.
Because of the limitations of available measuring instruments, however, observa-
tions on variables that are inherently continuous are recorded as if they were discrete.
Height, for example, is usually recorded to the nearest one-quarter, one-half, or whole
inch, whereas, with a perfect measuring device, such a measurement could be made as
precise as desired.
Population The average person thinks of a population as a collection of entities,
usually people. A population or collection of entities may, however, consist of animals,
machines, places, or cells. For our purposes, we define a population of entities as the
largest collection of entities for which we have an interest at a particular time. If we take a
measurement of some variable on each of the entities in a population, we generate a
population of values of that variable. We may, therefore, define a population of values as
the largest collection of values of a random variable for which we have an interest at a
particular time. If, for example, we are interested in the weights of all the children enrolled
in a certain county elementary school system, our population consists of all these weights.
If our interest lies only in the weights of first-grade students in the system, we have a
different population—weights of first-grade students enrolled in the school system. Hence,
populations are determined or defined by our sphere of interest. Populations may be finite
or infinite. If a population of values consists of a fixed number of these values, the
population is said to be finite. If, on the other hand, a population consists of an endless
succession of values, the population is an infinite one.
Sample A sample may be defined simply as a part of a population. Suppose our
population consists of the weights of all the elementary school children enrolled in a certain
county school system. If we collect for analysis the weights of only a fraction of these
children, we have only a part of our population of weights, that is, we have a sample.
1.3 MEASUREMENT AND
MEASUREMENT SCALES
In the preceding discussion we used the word measurement several times in its usual sense,
and presumably the reader clearly understood the intended meaning. The word measure-
ment, however, may be given a more scientific definition. In fact, there is a whole body of
scientific literature devoted to the subject of measurement. Part of this literature is
concerned also with the nature of the numbers that result from measurements. Authorities
on the subject of measurement speak of measurement scales that result in the categoriza-
tion of measurements according to their nature. In this section we define measurement and
the four resulting measurement scales. A more detailed discussion of the subject is to be
found in the writings of Stevens (1,2).
1.3 MEASUREMENT AND MEASUREMENT SCALES 5
3GC01 11/07/2012 21:50:37 Page 6
Measurement This may be defined as the assignment of numbers to objects or
events according to a set of rules. The various measurement scales result from the fact that
measurement may be carried out under different sets of rules.
The Nominal Scale The lowest measurement scale is the nominal scale. As the
name implies it consists of “naming” observations or classifying them into various
mutually exclusive and collectively exhaustive categories. The practice of using numbers
to distinguish among the various medical diagnoses constitutes measurement on a nominal
scale. Other examples include such dichotomies as male–female, well–sick, under 65 years
of age–65 and over, child–adult, and married–not married.
The Ordinal Scale Whenever observations are not only different from category to
category but can be ranked according to some criterion, they are said to be measured on an
ordinal scale. Convalescing patients may be characterized as unimproved, improved, and
much improved. Individuals may be classified according to socioeconomic status as low,
medium, or high. The intelligence of children may be above average, average, or below
average. In each of these examples the members of any one category are all considered
equal, but the members of one category are considered lower, worse, or smaller than those
in another category, which in turn bears a similar relationship to another category. For
example, a much improved patient is in better health than one classified as improved, while
a patient who has improved is in better condition than one who has not improved. It is
usually impossible to infer that the difference between members of one category and the
next adjacent category is equal to the difference between members of that category and the
members of the next category adjacent to it. The degree of improvement between
unimproved and improved is probably not the same as that between improved and
much improved. The implication is that if a finer breakdown were made resulting in
more categories, these, too, could be ordered in a similar manner. The function of numbers
assigned to ordinal data is to order (or rank) the observations from lowest to highest and,
hence, the term ordinal.
The Interval Scale The interval scale is a more sophisticatedscale thanthe nominal
or ordinal in that with this scale not only is it possible to order measurements, but also the
distance between any two measurements is known. We know, say, that the difference between
a measurement of 20 and a measurement of 30 is equal to the difference between
measurements of 30 and 40. The ability to do this implies the use of a unit distance and
a zero point, both of which are arbitrary. The selected zero point is not necessarily a true zero
in that it does not have to indicate a total absence of the quantity being measured. Perhaps the
best example of an interval scale is provided by the way in which temperature is usually
measured (degrees Fahrenheit or Celsius). The unit of measurement is the degree, and the
point of comparison is the arbitrarily chosen “zero degrees,” which does not indicate a lackof
heat. The interval scale unlike the nominal and ordinal scales is a truly quantitative scale.
The Ratio Scale The highest level of measurement is the ratio scale. This scale is
characterized by the fact that equality of ratios as well as equality of intervals may be
determined. Fundamental to the ratio scale is a true zero point. The measurement of such
familiar traits as height, weight, and length makes use of the ratio scale.
6 CHAPTER 1 INTRODUCTION TO BIOSTATISTICS
3GC01 11/07/2012 21:50:37 Page 7
1.4 SAMPLINGAND
STATISTICAL INFERENCE
As noted earlier, one of the purposes of this book is to teach the concepts of statistical
inference, which we may define as follows:
DEFINITION
Statistical inference is the procedure by which we reach a conclusion
about a population on the basis of the information contained in a sample
that has been drawn from that population.
There are many kinds of samples that may be drawn from a population. Not every
kind of sample, however, can be used as a basis for making valid inferences about a
population. In general, in order to make a valid inference about a population, we need a
scientific sample from the population. There are also many kinds of scientific samples that
may be drawn froma population. The simplest of these is the simple randomsample. In this
section we define a simple random sample and show you how to draw one from a
population.
If we use the letter N to designate the size of a finite population and the letter n to
designate the size of a sample, we may define a simple random sample as follows:
DEFINITION
If a sample of size n is drawn from a population of size N in such a way
that every possible sample of size n has the same chance of being selected,
the sample is called a simple random sample.
The mechanics of drawing a sample to satisfy the definition of a simple random
sample is called simple random sampling.
We will demonstrate the procedure of simple randomsampling shortly, but first let us
consider the problemof whether to sample with replacement or without replacement. When
sampling with replacement is employed, every member of the population is available at
each draw. For example, suppose that we are drawing a sample from a population of former
hospital patients as part of a study of length of stay. Let us assume that the sampling
involves selecting from the shelves in the medical records department a sample of charts of
discharged patients. In sampling with replacement we would proceed as follows: select a
chart to be in the sample, record the length of stay, and return the chart to the shelf. The
chart is back in the “population” and may be drawn again on some subsequent draw, in
which case the length of stay will again be recorded. In sampling without replacement, we
would not return a drawn chart to the shelf after recording the length of stay, but would lay
it aside until the entire sample is drawn. Following this procedure, a given chart could
appear in the sample only once. As a rule, in practice, sampling is always done without
replacement. The significance and consequences of this will be explained later, but first let
us see howone goes about selecting a simple randomsample. To ensure true randomness of
selection, we will need to follow some objective procedure. We certainly will want to avoid
1.4 SAMPLING AND STATISTICAL INFERENCE 7
3GC01 11/07/2012 21:50:39 Page 8
using our own judgment to decide which members of the population constitute a random
sample. The following example illustrates one method of selecting a simple randomsample
from a population.
EXAMPLE 1.4.1
Gold et al. (A-1) studied the effectiveness on smoking cessation of bupropion SR, a
nicotine patch, or both, when co-administered with cognitive-behavioral therapy. Consec-
utive consenting patients assigned themselves to one of the three treatments. For illustrative
purposes, let us consider all these subjects to be a population of size N¼189. We wish to
select a simple random sample of size 10 from this population whose ages are shown in
Table 1.4.1.
TABLE 1.4.1 Ages of 189 Subjects Who Participated in a Study on Smoking
Cessation
Subject No. Age Subject No. Age Subject No. Age Subject No. Age
1 48 49 38 97 51 145 52
2 35 50 44 98 50 146 53
3 46 51 43 99 50 147 61
4 44 52 47 100 55 148 60
5 43 53 46 101 63 149 53
6 42 54 57 102 50 150 53
7 39 55 52 103 59 151 50
8 44 56 54 104 54 152 53
9 49 57 56 105 60 153 54
10 49 58 53 106 50 154 61
11 44 59 64 107 56 155 61
12 39 60 53 108 68 156 61
13 38 61 58 109 66 157 64
14 49 62 54 110 71 158 53
15 49 63 59 111 82 159 53
16 53 64 56 112 68 160 54
17 56 65 62 113 78 161 61
18 57 66 50 114 66 162 60
19 51 67 64 115 70 163 51
20 61 68 53 116 66 164 50
21 53 69 61 117 78 165 53
22 66 70 53 118 69 166 64
23 71 71 62 119 71 167 64
24 75 72 57 120 69 168 53
25 72 73 52 121 78 169 60
26 65 74 54 122 66 170 54
27 67 75 61 123 68 171 55
28 38 76 59 124 71 172 58
(Continued)
8 CHAPTER 1 INTRODUCTION TO BIOSTATISTICS
3GC01 11/07/2012 21:50:39 Page 9
Solution: One way of selecting a simple random sample is to use a table of random
numbers like that shown in the Appendix, Table A. As the first step, we locate
a random starting point in the table. This can be done in a number of ways,
one of which is to look away from the page while touching it with the point of
a pencil. The random starting point is the digit closest to where the pencil
touched the page. Let us assume that following this procedure led to a random
starting point in Table A at the intersection of row 21 and column 28. The
digit at this point is 5. Since we have 189 values to choose from, we can use
only the random numbers 1 through 189. It will be convenient to pick three-
digit numbers so that the numbers 001 through 189 will be the only eligible
numbers. The first three-digit number, beginning at our random starting point
is 532, a number we cannot use. The next number (going down) is 196, which
again we cannot use. Let us move down past 196, 372, 654, and 928 until we
come to 137, a number we can use. The age of the 137th subject from Table
1.4.1 is 43, the first value in our sample. We record the random number and
the corresponding age in Table 1.4.2. We record the random number to keep
track of the random numbers selected. Since we want to sample without
replacement, we do not want to include the same individual’s age twice.
Proceeding in the manner just described leads us to the remaining nine
random numbers and their corresponding ages shown in Table 1.4.2. Notice
that when we get to the end of the column, we simply move over three digits
29 37 77 57 125 69 173 62
30 46 78 52 126 77 174 62
31 44 79 54 127 76 175 54
32 44 80 53 128 71 176 53
33 48 81 62 129 43 177 61
34 49 82 52 130 47 178 54
35 30 83 62 131 48 179 51
36 45 84 57 132 37 180 62
37 47 85 59 133 40 181 57
38 45 86 59 134 42 182 50
39 48 87 56 135 38 183 64
40 47 88 57 136 49 184 63
41 47 89 53 137 43 185 65
42 44 90 59 138 46 186 71
43 48 91 61 139 34 187 71
44 43 92 55 140 46 188 73
45 45 93 61 141 46 189 66
46 40 94 56 142 48
47 48 95 52 143 47
48 49 96 54 144 43
Source: Data provided courtesy of Paul B. Gold, Ph.D.
Subject No. Age Subject No. Age Subject No. Age Subject No. Age
1.4 SAMPLING AND STATISTICAL INFERENCE 9
3GC01 11/07/2012 21:50:40 Page 10
to 028 and proceed up the column. We could have started at the top with the
number 369.
Thus we have drawn a simple random sample of size 10 from a
population of size 189. In future discussions, whenever the term simple
random sample is used, it will be understood that the sample has been drawn
in this or an equivalent manner. &
The preceding discussion of random sampling is presented because of the important
role that the sampling process plays in designing research studies and experiments. The
methodology and concepts employed in sampling processes will be described in more
detail in Section 1.5.
DEFINITION
A research study is a scientific study of a phenomenon of interest.
Research studies involve designing sampling protocols, collecting and
analyzing data, and providing valid conclusions based on the results of
the analyses.
DEFINITION
Experiments are a special type of research study in which observations
are made after specific manipulations of conditions have been carried
out; they provide the foundation for scientific research.
Despite the tremendous importance of random sampling in the design of research
studies and experiments, there are some occasions when random sampling may not be the
most appropriate method to use. Consequently, other sampling methods must be consid-
ered. The intention here is not to provide a comprehensive reviewof sampling methods, but
TABLE 1.4.2 Sample of
10 Ages Drawn from the
Ages in Table 1.4.1
Random
Number
Sample
Subject Number Age
137 1 43
114 2 66
155 3 61
183 4 64
185 5 65
028 6 38
085 7 59
181 8 57
018 9 57
164 10 50
10 CHAPTER 1 INTRODUCTION TO BIOSTATISTICS
3GC01 11/07/2012 21:50:40 Page 11
rather to acquaint the student with two additional sampling methods that are employed in
the health sciences, systematic sampling and stratified randomsampling. Interested readers
are referred to the books by Thompson (3) and Levy and Lemeshow (4) for detailed
overviews of various sampling methods and explanations of how sample statistics are
calculated when these methods are applied in research studies and experiments.
Systematic Sampling A sampling method that is widely used in healthcare
research is the systematic sample. Medical records, which contain raw data used in
healthcare research, are generally stored in a file system or on a computer and hence are
easy to select in a systematic way. Using systematic sampling methodology, a researcher
calculates the total number of records needed for the study or experiment at hand. A
random numbers table is then employed to select a starting point in the file system. The
record located at this starting point is called record x. A second number, determined by the
number of records desired, is selected to define the sampling interval (call this interval k).
Consequently, the data set would consist of records x, x þk, x þ2k, x þ3k, and so on, until
the necessary number of records are obtained.
EXAMPLE 1.4.2
Continuing with the study of Gold et al. (A-1) illustrated in the previous example, imagine
that we wanted a systematic sample of 10 subjects from those listed in Table 1.4.1.
Solution: To obtain a starting point, we will again use Appendix Table A. For purposes
of illustration, let us assume that the random starting point in Table Awas the
intersection of row 10 and column 30. The digit is a 4 and will serve as our
starting point, x. Since we are starting at subject 4, this leaves 185 remaining
subjects (i.e., 189–4) from which to choose. Since we wish to select 10
subjects, one method to define the sample interval, k, would be to take
185/10 ¼18.5. To ensure that there will be enough subjects, it is customary to
round this quotient down, and hence we will round the result to 18. The
resulting sample is shown in Table 1.4.3.
&
TABLE 1.4.3 Sample of 10 Ages Selected Using a
Systematic Sample from the Ages in Table 1.4.1
Systematically Selected Subject Number Age
4 44
22 66
40 47
58 53
76 59
94 56
112 68
130 47
148 60
166 64
1.4 SAMPLING AND STATISTICAL INFERENCE 11
3GC01 11/07/2012 21:50:40 Page 12
Stratified Random Sampling A common situation that may be encountered
in a population under study is one in which the sample units occur together in a grouped
fashion. On occasion, when the sample units are not inherently grouped, it may be possible
and desirable to group them for sampling purposes. In other words, it may be desirable to
partition a population of interest into groups, or strata, in which the sample units within a
particular stratum are more similar to each other than they are to the sample units that
compose the other strata. After the population is stratified, it is customary to take a random
sample independently from each stratum. This technique is called stratified random
sampling. The resulting sample is called a stratified random sample. Although the benefits
of stratified random sampling may not be readily observable, it is most often the case that
random samples taken within a stratum will have much less variability than a random
sample taken across all strata. This is true because sample units within each stratum tend to
have characteristics that are similar.
EXAMPLE 1.4.3
Hospital trauma centers are given ratings depending on their capabilities to treat various
traumas. In this system, a level 1 trauma center is the highest level of available trauma care
and a level 4 trauma center is the lowest level of available trauma care. Imagine that we are
interested in estimating the survival rate of trauma victims treated at hospitals within a
large metropolitan area. Suppose that the metropolitan area has a level 1, a level 2, and a
level 3 trauma center. We wish to take samples of patients fromthese trauma centers in such
a way that the total sample size is 30.
Solution: We assume that the survival rates of patients may depend quite significantly
on the trauma that they experienced and therefore on the level of care that
they receive. As a result, a simple random sample of all trauma patients,
without regard to the center at which they were treated, may not represent
true survival rates, since patients receive different care at the various trauma
centers. One way to better estimate the survival rate is to treat each trauma
center as a stratum and then randomly select 10 patient files from each of the
three centers. This procedure is based on the fact that we suspect that the
survival rates within the trauma centers are less variable than the survival
rates across trauma centers. Therefore, we believe that the stratified random
sample provides a better representation of survival than would a sample taken
without regard to differences within strata. &
It should be noted that two slight modifications of the stratified sampling technique
are frequently employed. To illustrate, consider again the trauma center example. In the
first place, a systematic sample of patient files could have been selected from each trauma
center (stratum). Such a sample is called a stratified systematic sample.
The second modification of stratified sampling involves selecting the sample from a
given stratum in such a way that the number of sample units selected from that stratum is
proportional to the size of the population of that stratum. Suppose, in our trauma center
example that the level 1 trauma center treated 100 patients and the level 2 and level 3
trauma centers treated only 10 each. In that case, selecting a random sample of 10 from
12 CHAPTER 1 INTRODUCTION TO BIOSTATISTICS
3GC01 11/07/2012 21:50:40 Page 13
each trauma center overrepresents the trauma centers with smaller patient loads. To avoid
this problem, we adjust the size of the sample taken from a stratum so that it is proportional
to the size of the stratum’s population. This type of sampling is called stratified sampling
proportional to size. The within-stratum samples can be either random or systematic as
described above.
EXERCISES
1.4.1 Using the table of random numbers, select a new random starting point, and draw another simple
random sample of size 10 from the data in Table 1.4.1. Record the ages of the subjects in this new
sample. Save your data for future use. What is the variable of interest in this exercise? What
measurement scale was used to obtain the measurements?
1.4.2 Select another simple random sample of size 10 from the population represented in Table 1.4.1.
Compare the subjects in this sample with those in the sample drawn in Exercise 1.4.1. Are there any
subjects who showed up in both samples? How many? Compare the ages of the subjects in the two
samples. How many ages in the first sample were duplicated in the second sample?
1.4.3 Using the table of random numbers, select a random sample and a systematic sample, each of size 15,
from the data in Table 1.4.1. Visually compare the distributions of the two samples. Do they appear
similar? Which appears to be the best representation of the data?
1.4.4 Construct an example where it would be appropriate to use stratified sampling. Discuss how you
would use stratified random sampling and stratified sampling proportional to size with this example.
Which do you think would best represent the population that you described in your example? Why?
1.5 THE SCIENTIFIC METHOD
ANDTHE DESIGNOF EXPERIMENTS
Data analyses using a broad range of statistical methods play a significant role in scientific
studies. The previous section highlighted the importance of obtaining samples in a
scientific manner. Appropriate sampling techniques enhance the likelihood that the results
of statistical analyses of a data set will provide valid and scientifically defensible results.
Because of the importance of the proper collection of data to support scientific discovery, it
is necessary to consider the foundation of such discovery—the scientific method—and to
explore the role of statistics in the context of this method.
DEFINITION
The scientific method is a process by which scientific information is
collected, analyzed, and reported in order to produce unbiased and
replicable results in an effort to provide an accurate representation of
observable phenomena.
The scientific method is recognized universally as the only truly acceptable way to
produce new scientific understanding of the world around us. It is based on an empirical
approach, in that decisions and outcomes are based on data. There are several key elements
1.5 THE SCIENTIFIC METHOD AND THE DESIGN OF EXPERIMENTS 13
3GC01 11/07/2012 21:50:40 Page 14
associated with the scientific method, and the concepts and techniques of statistics play a
prominent role in all these elements.
Making an Observation First, an observation is made of a phenomenon or a
group of phenomena. This observation leads to the formulation of questions or uncer-
tainties that can be answered in a scientifically rigorous way. For example, it is readily
observable that regular exercise reduces body weight in many people. It is also readily
observable that changing diet may have a similar effect. In this case there are two
observable phenomena, regular exercise and diet change, that have the same endpoint.
The nature of this endpoint can be determined by use of the scientific method.
Formulating a Hypothesis In the second step of the scientific method a
hypothesis is formulated to explain the observation and to make quantitative predictions
of new observations. Often hypotheses are generated as a result of extensive background
research and literature reviews. The objective is to produce hypotheses that are scientifi-
cally sound. Hypotheses may be stated as either research hypotheses or statistical
hypotheses. Explicit definitions of these terms are given in Chapter 7, which discusses
the science of testing hypotheses. Suffice it to say for now that a research hypothesis from
the weight-loss example would be a statement such as, “Exercise appears to reduce body
weight.” There is certainly nothing incorrect about this conjecture, but it lacks a truly
quantitative basis for testing. A statistical hypothesis may be stated using quantitative
terminology as follows: “The average (mean) loss of body weight of people who exercise is
greater than the average (mean) loss of body weight of people who do not exercise.” In this
statement a quantitative measure, the “average” or “mean” value, is hypothesized to be
greater in the sample of patients who exercise. The role of the statistician in this step of the
scientific method is to state the hypothesis in a way that valid conclusions may be drawn
and to interpret correctly the results of such conclusions.
Designing an Experiment The third step of the scientific method involves
designing an experiment that will yield the data necessary to validly test an appropriate
statistical hypothesis. This step of the scientific method, like that of data analysis, requires
the expertise of a statistician. Improperly designed experiments are the leading cause of
invalid results and unjustified conclusions. Further, most studies that are challenged by
experts are challenged on the basis of the appropriateness or inappropriateness of the
study’s research design.
Those who properly design research experiments make every effort to ensure that the
measurement of the phenomenon of interest is both accurate and precise. Accuracy refers
to the correctness of a measurement. Precision, on the other hand, refers to the consistency
of a measurement. It should be noted that in the social sciences, the term validity is
sometimes used to mean accuracy and that reliability is sometimes used to mean precision.
In the context of the weight-loss example given earlier, the scale used to measure the weight
of study participants would be accurate if the measurement is validated using a scale that is
properly calibrated. If, however, the scale is off by þ3 pounds, then each participant’s
weight would be 3 pounds heavier; the measurements would be precise in that each would
be wrong by þ3 pounds, but the measurements would not be accurate. Measurements that
are inaccurate or imprecise may invalidate research findings.
14 CHAPTER 1 INTRODUCTION TO BIOSTATISTICS
3GC01 11/07/2012 21:50:40 Page 15
The design of an experiment depends on the type of data that need to be collected to
test a specific hypothesis. As discussed in Section 1.2, data may be collected or made
available through a variety of means. For much scientific research, however, the standard
for data collection is experimentation. A true experimental design is one in which study
subjects are randomly assigned to an experimental group (or treatment group) and a control
group that is not directly exposed to a treatment. Continuing the weight-loss example, a
sample of 100 participants could be randomly assigned to two conditions using the
methods of Section 1.4. A sample of 50 of the participants would be assigned to a specific
exercise program and the remaining 50 would be monitored, but asked not to exercise for a
specific period of time. At the end of this experiment the average (mean) weight losses of
the two groups could be compared. The reason that experimental designs are desirable
is that if all other potential factors are controlled, a cause–effect relationship may be tested;
that is, all else being equal, we would be able to conclude or fail to conclude that the
experimental group lost weight as a result of exercising.
The potential complexity of research designs requires statistical expertise, and
Chapter 8 highlights some commonly used experimental designs. For a more in-depth
discussion of research designs, the interested reader may wish to refer to texts by Kuehl (5),
Keppel and Wickens (6), and Tabachnick and Fidell (7).
Conclusion In the execution of a research study or experiment, one would hope to
have collected the data necessary to draw conclusions, with some degree of confidence,
about the hypotheses that were posed as part of the design. It is often the case that
hypotheses need to be modified and retested with new data and a different design.
Whatever the conclusions of the scientific process, however, results are rarely considered
to be conclusive. That is, results need to be replicated, often a large number of times, before
scientific credence is granted them.
EXERCISES
1.5.1 Using the example of weight loss as an endpoint, discuss how you would use the scientific method to
test the observation that change in diet is related to weight loss. Include all of the steps, including the
hypothesis to be tested and the design of your experiment.
1.5.2 Continuing with Exercise 1.5.1, consider how you would use the scientific method to test the
observation that both exercise and change in diet are related to weight loss. Include all of the steps,
paying particular attention to how you might design the experiment and which hypotheses would be
testable given your design.
1.6 COMPUTERS AND
BIOSTATISTICAL ANALYSIS
The widespread use of computers has had a tremendous impact on health sciences research
in general and biostatistical analysis in particular. The necessity to perform long and
tedious arithmetic computations as part of the statistical analysis of data lives only in the
1.6 COMPUTERS AND BIOSTATISTICAL ANALYSIS 15
3GC01 11/07/2012 21:50:40 Page 16
memory of those researchers and practitioners whose careers antedate the so-called
computer revolution. Computers can perform more calculations faster and far more
accurately than can human technicians. The use of computers makes it possible for
investigators to devote more time to the improvement of the quality of raw data and the
interpretation of the results.
The current prevalence of microcomputers and the abundance of available statistical
software programs have further revolutionized statistical computing. The reader in search
of a statistical software package may wish to consult The American Statistician, a quarterly
publication of the American Statistical Association. Statistical software packages are
regularly reviewed and advertised in the periodical.
Computers currently on the market are equipped with random number generating
capabilities. As an alternative to using printed tables of randomnumbers, investigators may
use computers to generate the randomnumbers they need. Actually, the “random” numbers
generated by most computers are in reality pseudorandom numbers because they are the
result of a deterministic formula. However, as Fishman (8) points out, the numbers appear
to serve satisfactorily for many practical purposes.
The usefulness of the computer in the health sciences is not limited to statistical
analysis. The reader interested in learning more about the use of computers in the health
sciences will find the books by Hersh (4), Johns (5), Miller et al. (6), and Saba and
McCormick (7) helpful. Those who wish to derive maximum benefit from the Internet may
wish to consult the books Physicians’ Guide to the Internet (13) and Computers in
Nursing’s Nurses’ Guide to the Internet (14). Current developments in the use of computers
in biology, medicine, and related fields are reported in several periodicals devoted to
the subject. A few such periodicals are Computers in Biology and Medicine, Computers
and Biomedical Research, International Journal of Bio-Medical Computing, Computer
Methods and Programs in Biomedicine, Computer Applications in the Biosciences, and
Computers in Nursing.
Computer printouts are used throughout this book to illustrate the use of computers in
biostatistical analysis. The MINITAB, SPSS, R, and SAS
®
statistical software packages for
the personal computer have been used for this purpose.
1.7 SUMMARY
In this chapter we introduced the reader to the basic concepts of statistics. We defined
statistics as an area of study concerned with collecting and describing data and with making
statistical inferences. We defined statistical inference as the procedure by which we reach a
conclusion about a population on the basis of information contained in a sample drawn
fromthat population. We learned that a basic type of sample that will allowus to make valid
inferences is the simple random sample. We learned how to use a table of random numbers
to draw a simple random sample from a population.
The reader is provided with the definitions of some basic terms, such as variable
and sample, that are used in the study of statistics. We also discussed measurement and
defined four measurement scales—nominal, ordinal, interval, and ratio. The reader is
16 CHAPTER 1 INTRODUCTION TO BIOSTATISTICS
3GC01 11/07/2012 21:50:40 Page 17
also introduced to the scientific method and the role of statistics and the statistician in
this process.
Finally, we discussed the importance of computers in the performance of the
activities involved in statistics.
REVIEWQUESTIONS ANDEXERCISES
1. Explain what is meant by descriptive statistics.
2. Explain what is meant by inferential statistics.
3. Define:
(a) Statistics (b) Biostatistics
(c) Variable (d) Quantitative variable
(e) Qualitative variable (f) Random variable
(g) Population (h) Finite population
(i) Infinite population (j) Sample
(k) Discrete variable (l) Continuous variable
(m) Simple random sample (n) Sampling with replacement
(o) Sampling without replacement
4. Define the word measurement.
5. List, describe, and compare the four measurement scales.
6. For each of the following variables, indicate whether it is quantitative or qualitative and specify the
measurement scale that is employed when taking measurements on each:
(a) Class standing of the members of this class relative to each other
(b) Admitting diagnosis of patients admitted to a mental health clinic
(c) Weights of babies born in a hospital during a year
(d) Gender of babies born in a hospital during a year
(e) Range of motion of elbow joint of students enrolled in a university health sciences curriculum
(f) Under-arm temperature of day-old infants born in a hospital
7. For each of the following situations, answer questions a through e:
(a) What is the sample in the study?
(b) What is the population?
(c) What is the variable of interest?
(d) How many measurements were used in calculating the reported results?
(e) What measurement scale was used?
Situation A. A study of 300 households in a small southern town revealed that 20 percent had at least
one school-age child present.
Situation B. A study of 250 patients admitted to a hospital during the past year revealed that, on the
average, the patients lived 15 miles from the hospital.
8. Consider the two situations given in Exercise 7. For Situation A describe how you would use a
stratified random sample to collect the data. For Situation B describe how you would use systematic
sampling of patient records to collect the data.
REVIEWQUESTIONS AND EXERCISES 17
3GC01 11/07/2012 21:50:40 Page 18
REFERENCES
Methodology References
1. S. S. STEVENS, “On the Theory of Scales of Measurement,” Science, 103 (1946), 677–680.
2. S. S. STEVENS, “Mathematics, Measurement and Psychophysics,” in S. S. Stevens (ed.), Handbook of Experimental
Psychology, Wiley, New York, 1951.
3. STEVEN K. THOMPSON, Sampling (2nd ed.), Wiley, New York, 2002.
4. PAUL S. LEVY and STANLEY LEMESHOW, Sampling of Populations: Methods and Applications (3rd ed.), Wiley,
New York, 1999.
5. ROBERT O. KUEHL, Statistical Principles of Research Design and Analysis (2nd ed.), Duxbury Press, Belmont, CA,
1999.
6. GEOFFREY KEPPEL and THOMAS D. WICKENS, Design and Analysis: A Researcher’s Handbook (4th ed.), Prentice
Hall, Upper Saddle River, NJ, 2004.
7. BARBARA G. TABACHNICK and LINDA S. FIDELL, Experimental Designs using ANOVA, Thomson, Belmont, CA, 2007.
8. GEORGE S. FISHMAN, Concepts and Methods in Discrete Event Digital Simulation, Wiley, New York, 1973.
9. WILLIAM R. HERSH, Information Retrieval: A Health Care Perspective, Springer, New York, 1996.
10. MERIDA L. JOHNS, Information Management for Health Professions, Delmar Publishers, Albany, NY, 1997.
11. MARVIN J. MILLER, KENRIC W. HAMMOND, and MATTHEW G. HILE (eds.), Mental Health Computing, Springer,
New York, 1996.
12. VIRGINIA K. SABA and KATHLEEN A. MCCORMICK, Essentials of Computers for Nurses, McGraw-Hill, New York,
1996.
13. LEE HANCOCK, Physicians’ Guide to the Internet, Lippincott Williams & Wilkins Publishers, Philadelphia, 1996.
14. LESLIE H. NICOLL and TEENA H. OUELLETTE, Computers in Nursing’s Nurses’ Guide to the Internet, 3rd ed.,
Lippincott Williams & Wilkins Publishers, Philadelphia, 2001.
Applications References
A-1. PAUL B. GOLD, ROBERT N. RUBEY, and RICHARD T. HARVEY, “Naturalistic, Self-Assignment Comparative Trial
of Bupropion SR, a Nicotine Patch, or Both for Smoking Cessation Treatment in Primary Care,” American Journal
on Addictions, 11 (2002), 315–331.
18 CHAPTER 1 INTRODUCTION TO BIOSTATISTICS
3GC02 11/07/2012 21:58:58 Page 19
CHAPTER 2
DESCRIPTIVE STATISTICS
CHAPTER OVERVIEW
This chapter introduces a set of basic procedures and statistical measures for
describing data. Data generally consist of an extensive number of measure-
ments or observations that are toonumerous or complicatedtobe understood
through simple observation. Therefore, this chapter introduces several tech-
niques including the construction of tables, graphical displays, and basic
statistical computations that provide ways to condense and organize infor-
mation into a set of descriptive measures and visual devices that enhance the
understanding of complex data.
TOPICS
2.1 INTRODUCTION
2.2 THE ORDERED ARRAY
2.3 GROUPED DATA: THE FREQUENCY DISTRIBUTION
2.4 DESCRIPTIVE STATISTICS: MEASURES OF CENTRAL TENDENCY
2.5 DESCRIPTIVE STATISTICS: MEASURES OF DISPERSION
2.6 SUMMARY
LEARNING OUTCOMES
After studying this chapter, the student will
1. understand how data can be appropriately organized and displayed.
2. understand how to reduce data sets into a few useful, descriptive measures.
3. be able to calculate and interpret measures of central tendency, such as the mean,
median, and mode.
4. be able to calculate and interpret measures of dispersion, such as the range,
variance, and standard deviation.
19
3GC02 11/07/2012 21:58:58 Page 20
2.1 INTRODUCTION
In Chapter 1 we stated that the taking of a measurement and the process of counting yield
numbers that contain information. The objective of the person applying the tools of
statistics to these numbers is to determine the nature of this information. This task is made
much easier if the numbers are organized and summarized. When measurements of a
random variable are taken on the entities of a population or sample, the resulting values are
made available to the researcher or statistician as a mass of unordered data. Measurements
that have not been organized, summarized, or otherwise manipulated are called raw data.
Unless the number of observations is extremely small, it will be unlikely that these rawdata
will impart much information until they have been put into some kind of order.
In this chapter we learn several techniques for organizing and summarizing data so
that we may more easily determine what information they contain. The ultimate in
summarization of data is the calculation of a single number that in some way conveys
important information about the data from which it was calculated. Such single numbers
that are used to describe data are called descriptive measures. After studying this chapter
you will be able to compute several descriptive measures for both populations and samples
of data.
The purpose of this chapter is to equip you with skills that will enable you to
manipulate the information—in the form of numbers—that you encounter as a health
sciences professional. The better able you are to manipulate such information, the better
understanding you will have of the environment and forces that generate the information.
2.2 THE ORDEREDARRAY
A first step in organizing data is the preparation of an ordered array. An ordered array is a
listing of the values of a collection (either population or sample) in order of magnitude from
the smallest value to the largest value. If the number of measurements to be ordered is of
any appreciable size, the use of a computer to prepare the ordered array is highly desirable.
An ordered array enables one to determine quickly the value of the smallest
measurement, the value of the largest measurement, and other facts about the arrayed
data that might be needed in a hurry. We illustrate the construction of an ordered array with
the data discussed in Example 1.4.1.
EXAMPLE 2.2.1
Table 1.4.1 contains a list of the ages of subjects who participated in the study on smoking
cessation discussed in Example 1.4.1. As can be seen, this unordered table requires
considerable searching for us to ascertain such elementary information as the age of the
youngest and oldest subjects.
Solution: Table 2.2.1 presents the data of Table 1.4.1 in the formof an ordered array. By
referring to Table 2.2.1 we are able to determine quickly the age of the
youngest subject (30) and the age of the oldest subject (82). We also readily
note that about one-third of the subjects are 50 years of age or younger.
20 CHAPTER 2 DESCRIPTIVE STATISTICS
3GC02 11/07/2012 21:58:59 Page 21
&
Computer Analysis If additional computations and organization of a data set
have to be done by hand, the work may be facilitated by working from an ordered array. If
the data are to be analyzed by a computer, it may be undesirable to prepare an ordered array,
unless one is needed for reference purposes or for some other use. A computer does not
need for its user to first construct an ordered array before entering data for the construction
of frequency distributions and the performance of other analyses. However, almost all
computer statistical packages and spreadsheet programs contain a routine for sorting data
in either an ascending or descending order. See Figure 2.2.1, for example.
TABLE 2.2.1 Ordered Array of Ages of Subjects from Table 1.4.1
30 34 35 37 37 38 38 38 38 39 39 40 40 42 42
43 43 43 43 43 43 44 44 44 44 44 44 44 45 45
45 46 46 46 46 46 46 47 47 47 47 47 47 48 48
48 48 48 48 48 49 49 49 49 49 49 49 50 50 50
50 50 50 50 50 51 51 51 51 52 52 52 52 52 52
53 53 53 53 53 53 53 53 53 53 53 53 53 53 53
53 53 54 54 54 54 54 54 54 54 54 54 54 55 55
55 56 56 56 56 56 56 57 57 57 57 57 57 57 58
58 59 59 59 59 59 59 60 60 60 60 61 61 61 61
61 61 61 61 61 61 61 62 62 62 62 62 62 62 63
63 64 64 64 64 64 64 65 65 66 66 66 66 66 66
67 68 68 68 69 69 69 70 71 71 71 71 71 71 71
72 73 75 76 77 78 78 78 82
Dialog box:
Data
Session command:
Sort MTB > Sort C1 C2;
SUBC> By C1.
FIGURE 2.2.1 MINITAB dialog box for Example 2.2.1.
2.2 THE ORDERED ARRAY 21
3GC02 11/07/2012 21:58:59 Page 22
2.3 GROUPEDDATA: THE
FREQUENCY DISTRIBUTION
Although a set of observations can be made more comprehensible and meaningful by
means of an ordered array, further useful summarization may be achieved by grouping the
data. Before the days of computers one of the main objectives in grouping large data sets
was to facilitate the calculation of various descriptive measures such as percentages and
averages. Because computers can perform these calculations on large data sets without first
grouping the data, the main purpose in grouping data nowis summarization. One must bear
in mind that data contain information and that summarization is a way of making it easier to
determine the nature of this information. One must also be aware that reducing a large
quantity of information in order to summarize the data succinctly carries with it the
potential to inadvertently lose some amount of specificity with regard to the underlying
data set. Therefore, it is important to group the data sufficiently such that the vast amounts
of information are reduced into understandable summaries. At the same time data should
be summarized to the extent that useful intricacies in the data are not readily obvious.
To group a set of observations we select a set of contiguous, nonoverlapping intervals
such that each value in the set of observations can be placed in one, and only one, of the
intervals. These intervals are usually referred to as class intervals.
One of the first considerations when data are to be grouped is how many intervals to
include. Too few intervals are undesirable because of the resulting loss of information. On
the other hand, if too many intervals are used, the objective of summarization will not be
met. The best guide to this, as well as to other decisions to be made in grouping data, is your
knowledge of the data. It may be that class intervals have been determined by precedent, as
in the case of annual tabulations, when the class intervals of previous years are maintained
for comparative purposes. A commonly followed rule of thumb states that there should be
no fewer than five intervals and no more than 15. If there are fewer than five intervals, the
data have been summarized too much and the information they contain has been lost. If
there are more than 15 intervals, the data have not been summarized enough.
Those who need more specific guidance in the matter of deciding how many class
intervals to employ may use a formula given by Sturges (1). This formula gives
k = 1 ÷ 3:322 log
10
n ( ), where k stands for the number of class intervals and n is the
number of values in the data set under consideration. The answer obtained by applying
Sturges’s rule should not be regarded as final, but should be considered as a guide only. The
number of class intervals specified by the rule should be increased or decreased for
convenience and clear presentation.
Suppose, for example, that we have a sample of 275 observations that we want to
group. The logarithm to the base 10 of 275 is 2.4393. Applying Sturges’s formula gives
k = 1 ÷ 3:322 2:4393 ( ) ’ 9. In practice, other considerations might cause us to use eight
or fewer or perhaps 10 or more class intervals.
Another question that must be decided regards the width of the class intervals. Class
intervals generally should be of the same width, although this is sometimes impossible to
accomplish. This width may be determined by dividing the range by k, the number of class
intervals. Symbolically, the class interval width is given by
w =
R
k
(2.3.1)
22 CHAPTER 2 DESCRIPTIVE STATISTICS
3GC02 11/07/2012 21:58:59 Page 23
where R (the range) is the difference between the smallest and the largest observation in the
data set, and k is defined as above. As a rule this procedure yields a width that is
inconvenient for use. Again, we may exercise our good judgment and select a width
(usually close to one given by Equation 2.3.1) that is more convenient.
There are other rules of thumb that are helpful in setting up useful class intervals.
When the nature of the data makes them appropriate, class interval widths of 5 units, 10
units, and widths that are multiples of 10 tend to make the summarization more
comprehensible. When these widths are employed it is generally good practice to have
the lower limit of each interval end in a zero or 5. Usually class intervals are ordered from
smallest to largest; that is, the first class interval contains the smaller measurements and the
last class interval contains the larger measurements. When this is the case, the lower limit
of the first class interval should be equal to or smaller than the smallest measurement in the
data set, and the upper limit of the last class interval should be equal to or greater than the
largest measurement.
Most statistical packages allow users to interactively change the number of class
intervals and/or the class widths, so that several visualizations of the data can be obtained
quickly. This feature allows users to exercise their judgment in deciding which data display
is most appropriate for a given purpose. Let us use the 189 ages shown in Table 1.4.1 and
arrayed in Table 2.2.1 to illustrate the construction of a frequency distribution.
EXAMPLE 2.3.1
We wish to know how many class intervals to have in the frequency distribution of the data.
We also want to know how wide the intervals should be.
Solution: To get an idea as to the number of class intervals to use, we can apply
Sturges’s rule to obtain
k = 1 ÷ 3:322 log 189 ( )
= 1 ÷ 3:322 2:2764618 ( )
~ 9
Now let us divide the range by 9 to get some idea about the class
interval width. We have
R
k
=
82 ÷ 30
9
=
52
9
= 5:778
It is apparent that a class interval width of 5 or 10 will be more
convenient to use, as well as more meaningful to the reader. Suppose we
decide on 10. We may nowconstruct our intervals. Since the smallest value in
Table 2.2.1 is 30 and the largest value is 82, we may begin our intervals with
30 and end with 89. This gives the following intervals:
30–39
40–49
50–59
60–69
2.3 GROUPED DATA: THE FREQUENCY DISTRIBUTION 23
3GC02 11/07/2012 21:58:59 Page 24
70–79
80–89
We see that there are six of these intervals, three fewer than the number
suggested by Sturges’s rule.
It is sometimes useful to refer to the center, called the midpoint, of a
class interval. The midpoint of a class interval is determined by obtaining the
sum of the upper and lower limits of the class interval and dividing by 2.
Thus, for example, the midpoint of the class interval 30–39 is found to be
30 ÷ 39 ( )=2 = 34:5. &
When we group data manually, determining the number of values falling into each
class interval is merely a matter of looking at the ordered array and counting the number
of observations falling in the various intervals. When we do this for our example, we
have Table 2.3.1.
A table such as Table 2.3.1 is called a frequency distribution. This table shows the
way in which the values of the variable are distributed among the specified class intervals.
By consulting it, we can determine the frequency of occurrence of values within any one of
the class intervals shown.
Relative Frequencies It may be useful at times to know the proportion, rather
than the number, of values falling within a particular class interval. We obtain this
information by dividing the number of values in the particular class interval by the total
number of values. If, in our example, we wish to know the proportion of values between 50
and 59, inclusive, we divide 70 by 189, obtaining .3704. Thus we say that 70 out of 189, or
70/189ths, or .3704, of the values are between 50 and 59. Multiplying .3704 by 100 gives us
the percentage of values between 50 and 59. We can say, then, that 37.04 percent of the
subjects are between 50 and 59 years of age. We may refer to the proportion of values
falling within a class interval as the relative frequency of occurrence of values in that
interval. In Section 3.2 we shall see that a relative frequency may be interpreted also as the
probability of occurrence within the given interval. This probability of occurrence is also
called the experimental probability or the empirical probability.
TABLE 2.3.1 Frequency Distribution of
Ages of 189 Subjects Shown in Tables 1.4.1
and 2.2.1
Class Interval Frequency
30–39 11
40–49 46
50–59 70
60–69 45
70–79 16
80–89 1
Total 189
24 CHAPTER 2 DESCRIPTIVE STATISTICS
3GC02 11/07/2012 21:59:0 Page 25
In determining the frequency of values falling within two or more class intervals, we
obtain the sum of the number of values falling within the class intervals of interest.
Similarly, if we want to know the relative frequency of occurrence of values falling within
two or more class intervals, we add the respective relative frequencies. We may sum, or
cumulate, the frequencies and relative frequencies to facilitate obtaining information
regarding the frequency or relative frequency of values within two or more contiguous
class intervals. Table 2.3.2 shows the data of Table 2.3.1 along with the cumulative
frequencies, the relative frequencies, and cumulative relative frequencies.
Suppose that we are interested in the relative frequency of values between 50 and 79.
We use the cumulative relative frequency column of Table 2.3.2 and subtract .3016 from
.9948, obtaining .6932.
We may use a statistical package to obtain a table similar to that shown in Table 2.3.2.
Tables obtained from both MINITAB and SPSS software are shown in Figure 2.3.1.
The Histogram We may display a frequency distribution (or a relative frequency
distribution) graphically in the form of a histogram, which is a special type of bar graph.
When we construct a histogram the values of the variable under consideration are
represented by the horizontal axis, while the vertical axis has as its scale the frequency (or
relative frequency if desired) of occurrence. Above each class interval on the horizontal
axis a rectangular bar, or cell, as it is sometimes called, is erected so that the height
corresponds to the respective frequency when the class intervals are of equal width. The
cells of a histogram must be joined and, to accomplish this, we must take into account the
true boundaries of the class intervals to prevent gaps from occurring between the cells of
our graph.
The level of precision observed in reported data that are measured on a continuous
scale indicates some order of rounding. The order of rounding reflects either the reporter’s
personal preference or the limitations of the measuring instrument employed. When a
frequency distribution is constructed from the data, the class interval limits usually reflect
the degree of precision of the raw data. This has been done in our illustrative example.
TABLE 2.3.2 Frequency, Cumulative Frequency, Relative Frequency,
and Cumulative Relative Frequency Distributions of the Ages of Subjects
Described in Example 1.4.1
Class
Interval Frequency
Cumulative
Frequency
Relative
Frequency
Cumulative
Relative
Frequency
30–39 11 11 .0582 .0582
40–49 46 57 .2434 .3016
50–59 70 127 .3704 .6720
60–69 45 172 .2381 .9101
70–79 16 188 .0847 .9948
80–89 1 189 .0053 1.0001
Total 189 1.0001
Note: Frequencies do not add to 1.0000 exactly because of rounding.
2.3 GROUPED DATA: THE FREQUENCY DISTRIBUTION 25
3GC02 11/07/2012 21:59:0 Page 26
We know, however, that some of the values falling in the second class interval, for example,
when measured precisely, would probably be a little less than 40 and some would be a little
greater than 49. Considering the underlying continuity of our variable, and assuming that
the data were rounded to the nearest whole number, we find it convenient to think of 39.5
and 49.5 as the true limits of this second interval. The true limits for each of the class
intervals, then, we take to be as shown in Table 2.3.3.
If we construct a graph using these class limits as the base of our rectangles, no gaps
will result, and we will have the histogram shown in Figure 2.3.2. We used MINITAB to
construct this histogram, as shown in Figure 2.3.3.
We refer to the space enclosed by the boundaries of the histogram as the area of the
histogram. Each observation is allotted one unit of this area. Since we have 189
observations, the histogram consists of a total of 189 units. Each cell contains a certain
proportion of the total area, depending on the frequency. The second cell, for example,
contains 46/189 of the area. This, as we have learned, is the relative frequency of
occurrence of values between 39.5 and 49.5. From this we see that subareas of the
histogram defined by the cells correspond to the frequencies of occurrence of values
between the horizontal scale boundaries of the areas. The ratio of a particular subarea to the
total area of the histogram is equal to the relative frequency of occurrence of values
between the corresponding points on the horizontal axis.
: d n a m m o c n o i s s e S : x o b g o l a i D
Stat Tables Tally Individual Variables MTB > Tally C2;
SUBC> Counts;
Type C2 in Variables. Check Counts, Percents, SUBC> CumCounts;
Cumulative counts, and Cumulative percents in SUBC> Percents;
Display. Click OK. SUBC> CumPercents;
Output:
Tally for Discrete Variables: C2
t u p t u O S S P S t u p t u O B A T I N I M
C2 Count CumCnt Percent CumPct
0 11 11 5.82 5.82
1 46 57 24.34 30.16
2 70 127 37.04 67.20
3 45 172 23.81 91.01
4 16 188 8.47 99.47
5 1 189 0.53 100.00
N= 189
Valid Cumulative
Frequency Percent Percent Percent
Valid 30-39 11 5.8 5.8 5.8
40-49 46 24.3 24.3 30.2
50-59 70 37.0 37.0 67.2
60-69 45 23.8 23.8 91.0
70-79 16 8.5 8.5 99.5
80-89 1 .5 .5 100.0
Total 189 100.0 100.0
FIGURE 2.3.1 Frequency, cumulative frequencies, percent, and cumulative percent
distribution of the ages of subjects described in Example 1.4.1 as constructed by MINITAB and
SPSS.
26 CHAPTER 2 DESCRIPTIVE STATISTICS
3GC02 11/07/2012 21:59:0 Page 27
The Frequency Polygon A frequency distribution can be portrayed graphically
in yet another way by means of a frequency polygon, which is a special kind of line graph.
To draw a frequency polygon we first place a dot above the midpoint of each class interval
represented on the horizontal axis of a graph like the one shown in Figure 2.3.2. The height
of a given dot above the horizontal axis corresponds to the frequency of the relevant class
interval. Connecting the dots by straight lines produces the frequency polygon. Figure 2.3.4
is the frequency polygon for the age data in Table 2.2.1.
Note that the polygon is brought down to the horizontal axis at the ends at points that
would be the midpoints if there were an additional cell at each end of the corresponding
histogram. This allows for the total area to be enclosed. The total area under the frequency
polygon is equal to the area under the histogram. Figure 2.3.5 shows the frequency polygon
of Figure 2.3.4 superimposed on the histogram of Figure 2.3.2. This figure allows you to
see, for the same set of data, the relationship between the two graphic forms.
TABLE 2.3.3 The Data of
Table 2.3.1 Showing True Class
Limits
True Class Limits Frequency
29.5–39.5 11
39.5–49.5 46
49.5–59.5 70
59.5–69.5 45
69.5–79.5 16
79.5–89.5 1
Total 189
34.5 44.5 54.5 64.5 74.5 84.5
Age
0
10
20
30
40
50
60
70
F
r
e
q
u
e
n
c
y
FIGURE 2.3.2 Histogram of ages of
189 subjects from Table 2.3.1.
: d n a m m o c n o i s s e S : x o b g o l a i D
Graph Histogram Simple OK MTB > Histogram 'Age';
SUBC> MidPoint 34.5:84.5/10;
Type Age in Graph Variables: Click OK. SUBC> Bar.
Now double click the histogram and click Binning Tab.
Type 34.5:84.5/10 in MidPoint/CutPoint positions:
Click OK.
FIGURE 2.3.3 MINITAB dialog box and session command for constructing histogram from
data on ages in Example 1.4.1.
2.3 GROUPED DATA: THE FREQUENCY DISTRIBUTION 27
3GC02 11/07/2012 21:59:0 Page 28
Stem-and-Leaf Displays Another graphical device that is useful for represent-
ing quantitative data sets is the stem-and-leaf display. A stem-and-leaf display bears a
strong resemblance to a histogram and serves the same purpose. A properly constructed
stem-and-leaf display, like a histogram, provides information regarding the range of the
data set, shows the location of the highest concentration of measurements, and reveals the
presence or absence of symmetry. An advantage of the stem-and-leaf display over the
histogram is the fact that it preserves the information contained in the individual
measurements. Such information is lost when measurements are assigned to the class
intervals of a histogram. As will become apparent, another advantage of stem-and-leaf
displays is the fact that they can be constructed during the tallying process, so the
intermediate step of preparing an ordered array is eliminated.
To construct a stem-and-leaf display we partition each measurement into two parts.
The first part is called the stem, and the second part is called the leaf. The stem consists of
one or more of the initial digits of the measurement, and the leaf is composed of one or
more of the remaining digits. All partitioned numbers are shown together in a single
display; the stems form an ordered column with the smallest stem at the top and the largest
at the bottom. We include in the stem column all stems within the range of the data even
when a measurement with that stem is not in the data set. The rows of the display contain
the leaves, ordered and listed to the right of their respective stems. When leaves consist of
more than one digit, all digits after the first may be deleted. Decimals when present in the
original data are omitted in the stem-and-leaf display. The stems are separated from their
leaves by a vertical line. Thus we see that a stem-and-leaf display is also an ordered array of
the data.
Stem-and-leaf displays are most effective with relatively small data sets. As a rule
they are not suitable for use in annual reports or other communications aimed at the general
public. They are primarily of value in helping researchers and decision makers understand
the nature of their data. Histograms are more appropriate for externally circulated
publications. The following example illustrates the construction of a stem-and-leaf display.
0
10
20
30
40
50
60
70
F
r
e
q
u
e
n
c
y
74.5 84.5 94.5 24.5 34.5 44.5 54.5 64.5
Age
FIGURE 2.3.4 Frequency polygon for the ages of
189 subjects shown in Table 2.2.1.
0
10
20
30
40
50
60
70
F
r
e
q
u
e
n
c
y
74.5 84.5 94.5 24.5 34.5 44.5 54.5 64.5
Age
FIGURE 2.3.5 Histogram and frequency polygon
for the ages of 189 subjects shown in Table 2.2.1.
28 CHAPTER 2 DESCRIPTIVE STATISTICS
3GC02 11/07/2012 21:59:0 Page 29
EXAMPLE 2.3.2
Let us use the age data shown in Table 2.2.1 to construct a stem-and-leaf display.
Solution: Since the measurements are all two-digit numbers, we will have one-digit
stems and one-digit leaves. For example, the measurement 30 has a stem of 3
and a leaf of 0. Figure 2.3.6 shows the stem-and-leaf display for the data.
The MINITAB statistical software package may be used to construct
stem-and-leaf displays. The MINITAB procedure and output are as shown in
Figure 2.3.7. The increment subcommand specifies the distance from one
stem to the next. The numbers in the leftmost output column of Figure 2.3.7
Stem Leaf
3 04577888899
4 0022333333444444455566666677777788888889999999
5 0000000011112222223333333333333333344444444444555666666777777788999999
6 000011111111111222222233444444556666667888999
7 0111111123567888
8 2
FIGURE 2.3.6 Stem-and-leaf display of ages of 189 subjects shown in Table 2.2.1 (stem
unit = 10, leaf unit = 1).
: d n a m m o c n o i s s e S : x o b g o l a i D
Graph Stem-and-Leaf MTB > Stem-and-Leaf 'Age';
SUBC> Increment 10.
Type Age in Graph Variables. Type 10 in Increment.
Click OK.
Output:
Stem-and-Leaf Display: Age
Stem-and-leaf of Age N = 189
Leaf Unit = 1.0
11 3 04577888899
57 4 0022333333444444455566666677777788888889999999
(70) 5 00000000111122222233333333333333333444444444445556666667777777889+
62 6 000011111111111222222233444444556666667888999
17 7 0111111123567888
1 8 2
FIGURE 2.3.7 Stem-and-leaf display prepared by MINITAB from the data on subjects’ ages
shown in Table 2.2.1.
2.3 GROUPED DATA: THE FREQUENCY DISTRIBUTION 29
3GC02 11/07/2012 21:59:0 Page 30
provide information regarding the number of observations (leaves) on a given
line and above or the number of observations on a given line and below. For
example, the number 57 on the second line shows that there are 57
observations (or leaves) on that line and the one above it. The number 62
on the fourth line from the top tells us that there are 62 observations on that
line and all the ones below. The number in parentheses tells us that there are
70 observations on that line. The parentheses mark the line containing the
middle observation if the total number of observations is odd or the two
middle observations if the total number of observations is even.
The ÷ at the end of the third line in Figure 2.3.7 indicates that the
frequency for that line (age group 50 through 59) exceeds the line capacity,
and that there is at least one additional leaf that is not shown. In this case, the
frequency for the 50–59 age group was 70. The line contains only 65 leaves,
so the ÷ indicates that there are five more leaves, the number 9, that are not
shown. &
One way to avoid exceeding the capacity of a line is to have more lines. This is
accomplished by making the distance between lines shorter, that is, by decreasing the
widths of the class intervals. For the present example, we may use class interval widths of 5,
so that the distance between lines is 5. Figure 2.3.8 shows the result when MINITABis used
to produce the stem-and-leaf display.
EXERCISES
2.3.1 In a study of the oral home care practice and reasons for seeking dental care among individuals on
renal dialysis, Atassi (A-1) studied 90 subjects on renal dialysis. The oral hygiene status of all
subjects was examined using a plaque index with a range of 0 to 3 (0 = no soft plaque deposits,
Stem-and-leaf of Age N = 189
Leaf Unit = 1.0
2 3 04
11 3 577888899
28 4 00223333334444444
57 4 55566666677777788888889999999
(46) 5 0000000011112222223333333333333333344444444444
86 5 555666666777777788999999
62 6 000011111111111222222233444444
32 6 556666667888999
17 7 0111111123
7 7 567888
1 8 2
FIGURE 2.3.8 Stem-and-leaf display prepared by MINITAB from the data on subjects’ ages
shown in Table 2.2.1; class interval width = 5.
30 CHAPTER 2 DESCRIPTIVE STATISTICS
3GC02 11/07/2012 21:59:1 Page 31
3 = an abundance of soft plaque deposits). The following table shows the plaque index scores for all
90 subjects.
1.17 2.50 2.00 2.33 1.67 1.33
1.17 2.17 2.17 1.33 2.17 2.00
2.17 1.17 2.50 2.00 1.50 1.50
1.00 2.17 2.17 1.67 2.00 2.00
1.33 2.17 2.83 1.50 2.50 2.33
0.33 2.17 1.83 2.00 2.17 2.00
1.00 2.17 2.17 1.33 2.17 2.50
0.83 1.17 2.17 2.50 2.00 2.50
0.50 1.50 2.00 2.00 2.00 2.00
1.17 1.33 1.67 2.17 1.50 2.00
1.67 0.33 1.50 2.17 2.33 2.33
1.17 0.00 1.50 2.33 1.83 2.67
0.83 1.17 1.50 2.17 2.67 1.50
2.00 2.17 1.33 2.00 2.33 2.00
2.17 2.17 2.00 2.17 2.00 2.17
Source: Data provided courtesy of Farhad
Atassi, DDS, MSc, FICOI.
(a) Use these data to prepare:
A frequency distribution
A relative frequency distribution
A cumulative frequency distribution
A cumulative relative frequency distribution
A histogram
A frequency polygon
(b) What percentage of the measurements are less than 2.00?
(c) What proportion of the subjects have measurements greater than or equal to 1.50?
(d) What percentage of the measurements are between 1.50 and 1.99 inclusive?
(e) How many of the measurements are greater than 2.49?
(f) What proportion of the measurements are either less than 1.0 or greater than 2.49?
(g) Someone picks a measurement at random from this data set and asks you to guess the value.
What would be your answer? Why?
(h) Frequency distributions and their histograms may be described in a number of ways depending
on their shape. For example, they may be symmetric (the left half is at least approximately a mirror
image of the right half), skewed to the left (the frequencies tend to increase as the measurements
increase in size), skewed to the right (the frequencies tend to decrease as the measurements increase
in size), or U-shaped (the frequencies are high at each end of the distribution and small in the center).
How would you describe the present distribution?
2.3.2 Janardhan et al. (A-2) conducted a study in which they measured incidental intracranial aneurysms
(IIAs) in 125 patients. The researchers examined postprocedural complications and concluded that
IIAs can be safely treated without causing mortality and with a lower complications rate than
previously reported. The following are the sizes (in millimeters) of the 159 IIAs in the sample.
8.1 10.0 5.0 7.0 10.0 3.0
20.0 4.0 4.0 6.0 6.0 7.0
(Continued )
EXERCISES 31
3GC02 11/07/2012 21:59:2 Page 32
10.0 4.0 3.0 5.0 6.0 6.0
6.0 6.0 6.0 5.0 4.0 5.0
6.0 25.0 10.0 14.0 6.0 6.0
4.0 15.0 5.0 5.0 8.0 19.0
21.0 8.3 7.0 8.0 5.0 8.0
5.0 7.5 7.0 10.0 15.0 8.0
10.0 3.0 15.0 6.0 10.0 8.0
7.0 5.0 10.0 3.0 7.0 3.3
15.0 5.0 5.0 3.0 7.0 8.0
3.0 6.0 6.0 10.0 15.0 6.0
3.0 3.0 7.0 5.0 4.0 9.2
16.0 7.0 8.0 5.0 10.0 10.0
9.0 5.0 5.0 4.0 8.0 4.0
3.0 4.0 5.0 8.0 30.0 14.0
15.0 2.0 8.0 7.0 12.0 4.0
3.8 10.0 25.0 8.0 9.0 14.0
30.0 2.0 10.0 5.0 5.0 10.0
22.0 5.0 5.0 3.0 4.0 8.0
7.5 5.0 8.0 3.0 5.0 7.0
8.0 5.0 9.0 11.0 2.0 10.0
6.0 5.0 5.0 12.0 9.0 8.0
15.0 18.0 10.0 9.0 5.0 6.0
6.0 8.0 12.0 10.0 5.0
5.0 16.0 8.0 5.0 8.0
4.0 16.0 3.0 7.0 13.0
Source: Data provided courtesy of
Vallabh Janardhan, M.D.
(a) Use these data to prepare:
A frequency distribution
A relative frequency distribution
A cumulative frequency distribution
A cumulative relative frequency distribution
A histogram
A frequency polygon
(b) What percentage of the measurements are between 10 and 14.9 inclusive?
(c) How many observations are less than 20?
(d) What proportion of the measurements are greater than or equal to 25?
(e) What percentage of the measurements are either less than 10.0 or greater than 19.95?
(f) Refer to Exercise 2.3.1, part h. Describe the distribution of the size of the aneurysms in this sample.
2.3.3 Hoekema et al. (A-3) studied the craniofacial morphology of patients diagnosed with obstructive
sleep apnea syndrome (OSAS) in healthy male subjects. One of the demographic variables the
researchers collected for all subjects was the Body Mass Index (calculated by dividing weight in kg
by the square of the patient’s height in cm). The following are the BMI values of 29 OSAS subjects.
33.57 27.78 40.81
38.34 29.01 47.78
26.86 54.33 28.99
(Continued )
32 CHAPTER 2 DESCRIPTIVE STATISTICS
3GC02 11/07/2012 21:59:2 Page 33
25.21 30.49 27.38
36.42 41.50 29.39
24.54 41.75 44.68
24.49 33.23 47.09
29.07 28.21 42.10
26.54 27.74 33.48
31.44 30.08
Source: Data provided courtesy
of A. Hoekema, D.D.S.
(a) Use these data to construct:
A frequency distribution
A relative frequency distribution
A cumulative frequency distribution
A cumulative relative frequency distribution
A histogram
A frequency polygon
(b) What percentage of the measurements are less than 30?
(c) What percentage of the measurements are between 40.0 and 49.99 inclusive?
(d) What percentage of the measurements are greater than 34.99?
(e) Describe these data with respect to symmetry and skewness as discussed in Exercise 2.3.1, part h.
(f) How many of the measurements are less than 40?
2.3.4 David Holben (A-4) studied selenium levels in beef raised in a low selenium region of the United
States. The goal of the study was to compare selenium levels in the region-raised beef to selenium
levels in cooked venison, squirrel, and beef from other regions of the United States. The data below
are the seleniumlevels calculated on a dry weight basis in mg=100 g for a sample of 53 region-raised
cattle.
11.23 15.82
29.63 27.74
20.42 22.35
10.12 34.78
39.91 35.09
32.66 32.60
38.38 37.03
36.21 27.00
16.39 44.20
27.44 13.09
17.29 33.03
56.20 9.69
28.94 32.45
20.11 37.38
25.35 34.91
21.77 27.99
31.62 22.36
32.63 22.68
30.31 26.52
46.16 46.01
(Continued )
EXERCISES 33
3GC02 11/07/2012 21:59:3 Page 34
56.61 38.04
24.47 30.88
29.39 30.04
40.71 25.91
18.52 18.54
27.80 25.51
19.49
Source: Data provided courtesy
of David Holben, Ph.D.
(a) Use these data to construct:
A frequency distribution
A relative frequency distribution
A cumulative frequency distribution
A cumulative relative frequency distribution
A histogram
A frequency polygon
(b) Describe these data with respect to symmetry and skewness as discussed in Exercise 2.3.1, part h.
(c) How many of the measurements are greater than 40?
(d) What percentage of the measurements are less than 25?
2.3.5 The following table shows the number of hours 45 hospital patients slept following the administration
of a certain anesthetic.
7 10 12 4 8 7 3 8 5
12 11 3 8 1 1 13 10 4
4 5 5 8 7 7 3 2 3
8 13 1 7 17 3 4 5 5
3 1 17 10 4 7 7 11 8
(a) From these data construct:
A frequency distribution
A relative frequency distribution
A histogram
A frequency polygon
(b) Describe these data relative to symmetry and skewness as discussed in Exercise 2.3.1, part h.
2.3.6 The following are the number of babies born during a year in 60 community hospitals.
30 55 27 45 56 48 45 49 32 57 47 56
37 55 52 34 54 42 32 59 35 46 24 57
32 26 40 28 53 54 29 42 42 54 53 59
39 56 59 58 49 53 30 53 21 34 28 50
52 57 43 46 54 31 22 31 24 24 57 29
(a) From these data construct:
A frequency distribution
A relative frequency distribution
A histogram
A frequency polygon
(b) Describe these data relative to symmetry and skewness as discussed in Exercise 2.3.1, part h.
34 CHAPTER 2 DESCRIPTIVE STATISTICS
3GC02 11/07/2012 21:59:4 Page 35
2.3.7 In a study of physical endurance levels of male college freshman, the following composite endurance
scores based on several exercise routines were collected.
254 281 192 260 212 179 225 179 181 149
182 210 235 239 258 166 159 223 186 190
180 188 135 233 220 204 219 211 245 151
198 190 151 157 204 238 205 229 191 200
222 187 134 193 264 312 214 227 190 212
165 194 206 193 218 198 241 149 164 225
265 222 264 249 175 205 252 210 178 159
220 201 203 172 234 198 173 187 189 237
272 195 227 230 168 232 217 249 196 223
232 191 175 236 152 258 155 215 197 210
214 278 252 283 205 184 172 228 193 130
218 213 172 159 203 212 117 197 206 198
169 187 204 180 261 236 217 205 212 218
191 124 199 235 139 231 116 182 243 217
251 206 173 236 215 228 183 204 186 134
188 195 240 163 208
(a) From these data construct:
A frequency distribution
A relative frequency distribution
A frequency polygon
A histogram
(b) Describe these data relative to symmetry and skewness as discussed in Exercise 2.3.1, part h.
2.3.8 The following are the ages of 30 patients seen in the emergency room of a hospital on a Friday night.
Construct a stem-and-leaf display from these data. Describe these data relative to symmetry and
skewness as discussed in Exercise 2.3.1, part h.
35 32 21 43 39 60
36 12 54 45 37 53
45 23 64 10 34 22
36 45 55 44 55 46
22 38 35 56 45 57
2.3.9 The following are the emergency room charges made to a sample of 25 patients at two city hospitals.
Construct a stem-and-leaf display for each set of data. What does a comparison of the two displays
suggest regarding the two hospitals? Describe the two sets of data with respect to symmetry and
skewness as discussed in Exercise 2.3.1, part h.
Hospital A
249.10 202.50 222.20 214.40 205.90
214.30 195.10 213.30 225.50 191.40
201.20 239.80 245.70 213.00 238.80
171.10 222.00 212.50 201.70 184.90
248.30 209.70 233.90 229.80 217.90
EXERCISES 35
3GC02 11/07/2012 21:59:5 Page 36
Hospital B
199.50 184.00 173.20 186.00 214.10
125.50 143.50 190.40 152.00 165.70
154.70 145.30 154.60 190.30 135.40
167.70 203.40 186.70 155.30 195.90
168.90 166.70 178.60 150.20 212.40
2.3.10 Refer to the ages of patients discussed in Example 1.4.1 and displayed in Table 1.4.1.
(a) Use class interval widths of 5 and construct:
A frequency distribution
A relative frequency distribution
A cumulative frequency distribution
A cumulative relative frequency distribution
A histogram
A frequency polygon
(b) Describe these data with respect to symmetry and skewness as discussed in Exercise 2.3.1, part h.
2.3.11 The objectives of a study by Skjelbo et al. (A-5) were to examine (a) the relationship between
chloroguanide metabolism and efficacy in malaria prophylaxis and (b) the mephenytoin metabolism
and its relationship to chloroguanide metabolism among Tanzanians. From information provided
by urine specimens from the 216 subjects, the investigators computed the ratio of unchanged
S-mephenytoin to R-mephenytoin (S/R ratio). The results were as follows:
0.0269 0.0400 0.0550 0.0550 0.0650 0.0670 0.0700 0.0720
0.0760 0.0850 0.0870 0.0870 0.0880 0.0900 0.0900 0.0990
0.0990 0.0990 0.0990 0.0990 0.0990 0.0990 0.0990 0.0990
0.0990 0.0990 0.0990 0.0990 0.0990 0.0990 0.0990 0.0990
0.0990 0.0990 0.0990 0.0990 0.0990 0.0990 0.0990 0.0990
0.0990 0.0990 0.0990 0.0990 0.0990 0.1000 0.1020 0.1040
0.1050 0.1050 0.1080 0.1080 0.1090 0.1090 0.1090 0.1160
0.1190 0.1200 0.1230 0.1240 0.1340 0.1340 0.1370 0.1390
0.1460 0.1480 0.1490 0.1490 0.1500 0.1500 0.1500 0.1540
0.1550 0.1570 0.1600 0.1650 0.1650 0.1670 0.1670 0.1677
0.1690 0.1710 0.1720 0.1740 0.1780 0.1780 0.1790 0.1790
0.1810 0.1880 0.1890 0.1890 0.1920 0.1950 0.1970 0.2010
0.2070 0.2100 0.2100 0.2140 0.2150 0.2160 0.2260 0.2290
0.2390 0.2400 0.2420 0.2430 0.2450 0.2450 0.2460 0.2460
0.2470 0.2540 0.2570 0.2600 0.2620 0.2650 0.2650 0.2680
0.2710 0.2800 0.2800 0.2870 0.2880 0.2940 0.2970 0.2980
0.2990 0.3000 0.3070 0.3100 0.3110 0.3140 0.3190 0.3210
0.3400 0.3440 0.3480 0.3490 0.3520 0.3530 0.3570 0.3630
0.3630 0.3660 0.3830 0.3900 0.3960 0.3990 0.4080 0.4080
0.4090 0.4090 0.4100 0.4160 0.4210 0.4260 0.4290 0.4290
0.4300 0.4360 0.4370 0.4390 0.4410 0.4410 0.4430 0.4540
0.4680 0.4810 0.4870 0.4910 0.4980 0.5030 0.5060 0.5220
0.5340 0.5340 0.5460 0.5480 0.5480 0.5490 0.5550 0.5920
0.5930 0.6010 0.6240 0.6280 0.6380 0.6600 0.6720 0.6820
(Continued )
36 CHAPTER 2 DESCRIPTIVE STATISTICS
3GC02 11/07/2012 21:59:6 Page 37
0.6870 0.6900 0.6910 0.6940 0.7040 0.7120 0.7200 0.7280
0.7860 0.7950 0.8040 0.8200 0.8350 0.8770 0.9090 0.9520
0.9530 0.9830 0.9890 1.0120 1.0260 1.0320 1.0620 1.1600
Source: Data provided courtesy of Erik Skjelbo, M.D.
(a) From these data construct the following distributions: frequency, relative frequency, cumulative
frequency, and cumulative relative frequency; and the following graphs: histogram, frequency
polygon, and stem-and-leaf plot.
(b) Describe these data with respect to symmetry and skewness as discussed in Exercise 2.3.1, part h.
(c) The investigators defined as poor metabolizers of mephenytoin any subject with an S/ R
mephenytoin ratio greater than .9. How many and what percentage of the subjects were poor
metabolizers?
(d) How many and what percentage of the subjects had ratios less than .7? Between .3 and .6999
inclusive? Greater than .4999?
2.3.12 Schmidt et al. (A-6) conducted a study to investigate whether autotransfusion of shed mediastinal
blood could reduce the number of patients needing homologous blood transfusion and reduce the
amount of transfused homologous blood if fixed transfusion criteria were used. The following table
shows the heights in centimeters of the 109 subjects of whom 97 were males.
1.720 1.710 1.700 1.655 1.800 1.700
1.730 1.700 1.820 1.810 1.720 1.800
1.800 1.800 1.790 1.820 1.800 1.650
1.680 1.730 1.820 1.720 1.710 1.850
1.760 1.780 1.760 1.820 1.840 1.690
1.770 1.920 1.690 1.690 1.780 1.720
1.750 1.710 1.690 1.520 1.805 1.780
1.820 1.790 1.760 1.830 1.760 1.800
1.700 1.760 1.750 1.630 1.760 1.770
1.840 1.690 1.640 1.760 1.850 1.820
1.760 1.700 1.720 1.780 1.630 1.650
1.660 1.880 1.740 1.900 1.830
1.600 1.800 1.670 1.780 1.800
1.750 1.610 1.840 1.740 1.750
1.960 1.760 1.730 1.730 1.810
1.810 1.775 1.710 1.730 1.740
1.790 1.880 1.730 1.560 1.820
1.780 1.630 1.640 1.600 1.800
1.800 1.780 1.840 1.830
1.770 1.690 1.800 1.620
Source: Data provided courtesy of Erik Skjelbo, M.D.
(a) For these data construct the following distributions: frequency, relative frequency, cumulative
frequency, and cumulative relative frequency; and the following graphs: histogram, frequency
polygon, and stem-and-leaf plot.
(b) Describe these data with respect to symmetry and skewness as discussed in Exercise 2.3.1, part h.
(c) How do you account for the shape of the distribution of these data?
(d) How tall were the tallest 6.42 percent of the subjects?
(e) How tall were the shortest 10.09 percent of the subjects?
EXERCISES 37
3GC02 11/07/2012 21:59:6 Page 38
2.4 DESCRIPTIVE STATISTICS:
MEASURES OF CENTRAL TENDENCY
Although frequency distributions serve useful purposes, there are many situations that
require other types of data summarization. What we need in many instances is the ability to
summarize the data by means of a single number called a descriptive measure. Descriptive
measures may be computed from the data of a sample or the data of a population. To
distinguish between them we have the following definitions:
DEFINITIONS
1. Adescriptive measure computed fromthe data of a sample is called a
statistic.
2. A descriptive measure computed from the data of a population is
called a parameter.
Several types of descriptive measures can be computed from a set of data. In this
chapter, however, we limit discussion to measures of central tendency and measures of
dispersion. We consider measures of central tendency in this section and measures of
dispersion in the following one.
In each of the measures of central tendency, of which we discuss three, we have a
single value that is considered to be typical of the set of data as a whole. Measures of central
tendency convey information regarding the average value of a set of values. As we will see,
the word average can be defined in different ways.
The three most commonly used measures of central tendency are the mean, the
median, and the mode.
Arithmetic Mean The most familiar measure of central tendency is the arithmetic
mean. It is the descriptive measure most people have in mind when they speak of the
“average.” The adjective arithmetic distinguishes this mean from other means that can be
computed. Since we are not covering these other means in this book, we shall refer to the
arithmetic mean simply as the mean. The mean is obtained by adding all the values in a
population or sample and dividing by the number of values that are added.
EXAMPLE 2.4.1
We wish to obtain the mean age of the population of 189 subjects represented in Table 1.4.1.
Solution: We proceed as follows:
mean age =
48 ÷ 35 ÷ 46 ÷ ÷ 73 ÷ 66
189
= 55:032
&
The three dots in the numerator represent the values we did not show in order to save
space.
38 CHAPTER 2 DESCRIPTIVE STATISTICS
3GC02 11/07/2012 21:59:6 Page 39
General Formula for the Mean It will be convenient if we can generalize the
procedure for obtaining the mean and, also, represent the procedure in a more compact
notational form. Let us begin by designating the random variable of interest by the capital
letter X. In our present illustration we let X represent the random variable, age. Specific
values of a random variable will be designated by the lowercase letter x. To distinguish one
value from another, we attach a subscript to the x and let the subscript refer to the first, the
second, the third value, and so on. For example, from Table 1.4.1 we have
x
1
= 48; x
2
= 35; . . . ; x
189
= 66
In general, a typical value of a random variable will be designated by x
i
and the final value,
in a finite population of values, by x
N
, where N is the number of values in the population.
Finally, we will use the Greek letter m to stand for the population mean. We may now write
the general formula for a finite population mean as follows:
m =
P
N
i=1
x
i
N
(2.4.1)
The symbol
P
N
i=1
instructs us to add all values of the variable from the first to the last. This
symbol S, called the summation sign, will be used extensively in this book. When from the
context it is obvious which values are to be added, the symbols above and below S will be
omitted.
The Sample Mean When we compute the mean for a sample of values, the
procedure just outlined is followed with some modifications in notation. We use x to
designate the sample mean and n to indicate the number of values in the sample. The
sample mean then is expressed as
x =
P
n
i=1
x
i
n
(2.4.2)
EXAMPLE 2.4.2
In Chapter 1 we selected a simple random sample of 10 subjects from the population of
subjects represented in Table 1.4.1. Let us now compute the mean age of the 10 subjects in
our sample.
Solution: We recall (see Table 1.4.2) that the ages of the 10 subjects in our sample were
x
1
= 43; x
2
= 66; x
3
= 61; x
4
= 64; x
5
= 65; x
6
= 38; x
7
= 59; x
8
= 57;
x
9
= 57; x
10
= 50. Substitution of our sample data into Equation 2.4.2 gives
x =
P
n
i=1
x
i
n
=
43 ÷ 66 ÷ ÷ 50
10
= 56
&
2.4 DESCRIPTIVE STATISTICS: MEASURES OF CENTRAL TENDENCY 39
3GC02 11/07/2012 21:59:7 Page 40
Properties of the Mean The arithmetic mean possesses certain properties, some
desirable and some not so desirable. These properties include the following:
1. Uniqueness. For a given set of data there is one and only one arithmetic mean.
2. Simplicity. The arithmetic mean is easily understood and easy to compute.
3. Since each and every value in a set of data enters into the computation of the mean, it
is affected by each value. Extreme values, therefore, have an influence on the mean
and, in some cases, can so distort it that it becomes undesirable as a measure of
central tendency.
As an example of how extreme values may affect the mean, consider the following
situation. Suppose the five physicians who practice in an area are surveyed to determine
their charges for a certain procedure. Assume that they report these charges: $75, $75, $80,
$80, and $280. The mean charge for the five physicians is found to be $118, a value that is
not very representative of the set of data as a whole. The single atypical value had the effect
of inflating the mean.
Median The median of a finite set of values is that value which divides the set into
two equal parts such that the number of values equal to or greater than the median is
equal to the number of values equal to or less than the median. If the number of values is
odd, the median will be the middle value when all values have been arranged in order of
magnitude. When the number of values is even, there is no single middle value. Instead
there are two middle values. In this case the median is taken to be the mean of these two
middle values, when all values have been arranged in the order of their magnitudes. In
other words, the median observation of a data set is the n ÷ 1 ( )=2th one when the
observation have been ordered. If, for example, we have 11 observations, the median is
the 11 ÷ 1 ( )=2 = 6th ordered observation. If we have 12 observations the median is the
12 ÷ 1 ( )=2 = 6:5th ordered observation and is a value halfway between the 6th and 7th
ordered observations.
EXAMPLE 2.4.3
Let us illustrate by finding the median of the data in Table 2.2.1.
Solution: The values are already ordered so we need only to find the two middle values.
The middle value is the n ÷ 1 ( )=2 = 189 ÷ 1 ( )=2 = 190=2 = 95th one.
Counting from the smallest up to the 95th value we see that it is 54.
Thus the median age of the 189 subjects is 54 years. &
EXAMPLE 2.4.4
We wish to find the median age of the subjects represented in the sample described in
Example 2.4.2.
Solution: Arraying the 10 ages in order of magnitude from smallest to largest gives 38,
43, 50, 57, 57, 59, 61, 64, 65, 66. Since we have an even number of ages, there
40 CHAPTER 2 DESCRIPTIVE STATISTICS
3GC02 11/07/2012 21:59:7 Page 41
is no middle value. The two middle values, however, are 57 and 59. The
median, then, is 57 ÷ 59 ( )=2 = 58. &
Properties of the Median Properties of the median include the following:
1. Uniqueness. As is true with the mean, there is only one median for a given set of
data.
2. Simplicity. The median is easy to calculate.
3. It is not as drastically affected by extreme values as is the mean.
The Mode The mode of a set of values is that value which occurs most frequently. If
all the values are different there is no mode; on the other hand, a set of values may have
more than one mode.
EXAMPLE 2.4.5
Find the modal age of the subjects whose ages are given in Table 2.2.1.
Solution: A count of the ages in Table 2.2.1 reveals that the age 53 occurs most
frequently (17 times). The mode for this population of ages is 53. &
For an example of a set of values that has more than one mode, let us consider
a laboratory with 10 employees whose ages are 20, 21, 20, 20, 34, 22, 24, 27, 27,
and 27. We could say that these data have two modes, 20 and 27. The sample
consisting of the values 10, 21, 33, 53, and 54 has no mode since all the values are
different.
The mode may be used also for describing qualitative data. For example, suppose the
patients seen in a mental health clinic during a given year received one of the following
diagnoses: mental retardation, organic brain syndrome, psychosis, neurosis, and personal-
ity disorder. The diagnosis occurring most frequently in the group of patients would be
called the modal diagnosis.
An attractive property of a data distribution occurs when the mean, median, and
mode are all equal. The well-known “bell-shaped curve” is a graphical representation of
a distribution for which the mean, median, and mode are all equal. Much statistical
inference is based on this distribution, the most common of which is the normal
distribution. The normal distribution is introduced in Section 4.6 and discussed further
in subsequent chapters. Another common distribution of this type is the t-distribution,
which is introduced in Section 6.3.
Skewness Data distributions may be classified on the basis of whether they are
symmetric or asymmetric. If a distribution is symmetric, the left half of its graph
(histogram or frequency polygon) will be a mirror image of its right half. When the
left half and right half of the graph of a distribution are not mirror images of each other, the
distribution is asymmetric.
2.4 DESCRIPTIVE STATISTICS: MEASURES OF CENTRAL TENDENCY 41
3GC02 11/07/2012 21:59:7 Page 42
DEFINITION
If the graph (histogram or frequency polygon) of a distribution is
asymmetric, the distribution is said to be skewed . If a distribution is
not symmetric because its graph extends further to the right than to
the left, that is, if it has a long tail to the right, we say that the distribution
is skewed to the right or is positively skewed. If a distribution is not
symmetric because its graph extends further to the left than to the right,
that is, if it has a long tail to the left, we say that the distribution is
skewed to the left or is negatively skewed.
A distribution will be skewed to the right, or positively skewed, if its mean is greater
than its mode. A distribution will be skewed to the left, or negatively skewed, if its mean is
less than its mode. Skewness can be expressed as follows:
Skewness =
ffiffiffi
n
_ P
n
i=1
x
i
÷x ( )
3
P
n
i=1
x
i
÷x ( )
2

3=2
=
ffiffiffi
n
_ P
n
i=1
x
i
÷x ( )
3
n ÷ 1 ( )
ffiffiffiffiffiffiffiffiffiffiffi
n ÷ 1
_
s
3
(2.4.3)
In Equation 2.4.3, s is the standard deviation of a sample as defined in Equation 2.5.4. Most
computer statistical packages include this statistic as part of a standard printout. Avalue of
skewness > 0 indicates positive skewness and a value of skewness < 0 indicates negative
skewness. An illustration of skewness is shown in Figure 2.4.1.
EXAMPLE 2.4.6
Consider the three distributions shown in Figure 2.4.1. Given that the histograms represent
frequency counts, the data can be easily re-created and entered into a statistical package.
For example, observation of the “No Skew” distribution would yield the following data:
5, 5, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8, 8, 9, 9, 9, 9, 10, 10, 10, 11, 11. Values can be obtained from
FIGURE 2.4.1 Three histograms illustrating skewness.
42 CHAPTER 2 DESCRIPTIVE STATISTICS
3GC02 11/07/2012 21:59:7 Page 43
the skewed distributions in a similar fashion. Using SPSS software, the following
descriptive statistics were obtained for these three distributions
No Skew Right Skew Left Skew
Mean 8.0000 6.6667 8.3333
Median 8.0000 6.0000 9.0000
Mode 8.00 5.00 10.00
Skewness .000 .627 ÷.627
&
2.5 DESCRIPTIVE STATISTICS:
MEASURES OF DISPERSION
The dispersion of a set of observations refers to the variety that they exhibit. A measure of
dispersion conveys information regarding the amount of variability present in a set of data.
If all the values are the same, there is no dispersion; if they are not all the same, dispersion is
present in the data. The amount of dispersion may be small when the values, though
different, are close together. Figure 2.5.1 shows the frequency polygons for two popula-
tions that have equal means but different amounts of variability. Population B, which is
more variable than population A, is more spread out. If the values are widely scattered, the
dispersion is greater. Other terms used synonymously with dispersion include variation,
spread, and scatter.
The Range One way to measure the variation in a set of values is to compute the
range. The range is the difference between the largest and smallest value in a set of
observations. If we denote the range by R, the largest value by x
L
, and the smallest value
by x
S
, we compute the range as follows:
R = x
L
÷ x
S
(2.5.1)
Population A
Population B
m
FIGURE 2.5.1 Two frequency distributions with equal means but different amounts
of dispersion.
2.5 DESCRIPTIVE STATISTICS: MEASURES OF DISPERSION 43
3GC02 11/07/2012 21:59:8 Page 44
EXAMPLE 2.5.1
We wish to compute the range of the ages of the sample subjects discussed in Table 2.2.1.
Solution: Since the youngest subject in the sample is 30 years old and the oldest is 82,
we compute the range to be
R = 82 ÷ 30 = 52 &
The usefulness of the range is limited. The fact that it takes into account only two values
causes it to be a poor measure of dispersion. The main advantage in using the range is the
simplicity of its computation. Since the range, expressed as a single measure, imparts
minimal information about a data set and therefore is of limited use, it is often preferable to
express the range as a number pair, x
S
; x
L
[ [, in which x
S
and x
L
are the smallest and largest
values in the data set, respectively. For the data in Example 2.5.1, we may express the range
as the number pair [30, 82]. Although this is not the traditional expression for the range, it is
intuitive to imagine that knowledge of the minimum and maximum values in this data set
would convey more information than knowing only that the range is equal to 52. An infinite
number of distributions, each with quite different minimum and maximum values, may
have a range of 52.
The Variance When the values of a set of observations lie close to their mean, the
dispersion is less than when they are scattered over a wide range. Since this is true, it would
be intuitively appealing if we could measure dispersion relative to the scatter of the values
about their mean. Such a measure is realized in what is known as the variance. In
computing the variance of a sample of values, for example, we subtract the mean fromeach
of the values, square the resulting differences, and then add up the squared differences. This
sum of the squared deviations of the values from their mean is divided by the sample size,
minus 1, to obtain the sample variance. Letting s
2
stand for the sample variance, the
procedure may be written in notational form as follows:
s
2
=
P
n
i=1
x
i
÷x ( )
2
n ÷ 1
(2.5.2)
It is therefore easy to see that the variance can be described as the average squared
deviation of individual values from the mean of that set. It may seem nonintuitive at this
stage that the differences in the numerator be squared. However, consider a symmetric
distribution. It is easy to imagine that if we compute the difference of each data point in the
distribution from the mean value, half of the differences would be positive and half would
be negative, resulting in a sum that would be zero. A variance of zero would be a
noninformative measure for any distribution of numbers except one in which all of the
values are the same. Therefore, the square of each difference is used to ensure a positive
numerator and hence a much more valuable measure of dispersion.
EXAMPLE 2.5.2
Let us illustrate by computing the variance of the ages of the subjects discussed in
Example 2.4.2.
44 CHAPTER 2 DESCRIPTIVE STATISTICS
3GC02 11/07/2012 21:59:8 Page 45
Solution:
s
2
=
43 ÷ 56 ( )
2
÷ 66 ÷ 56 ( )
2
÷ ÷ 50 ÷ 56 ( )
2
9
=
810
9
= 90
&
Degrees of Freedom The reason for dividing by n ÷ 1 rather than n, as we might
have expected, is the theoretical consideration referred to as degrees of freedom. In
computing the variance, we say that we have n ÷ 1 degrees of freedom. We reason as
follows. The sum of the deviations of the values from their mean is equal to zero, as can be
shown. If, then, we know the values of n ÷ 1 of the deviations from the mean, we know the
nth one, since it is automatically determined because of the necessity for all n values to add
to zero. From a practical point of view, dividing the squared differences by n ÷ 1 rather than
n is necessary in order to use the sample variance in the inference procedures discussed
later. The concept of degrees of freedom will be revisited in a later chapter. Students
interested in pursuing the matter further at this time should refer to the article by Walker (2).
When we compute the variance from a finite population of N values, the procedures
outlined above are followed except that we subtract m from each x and divide by N rather
than N ÷ 1. If we let s
2
stand for the finite population variance, the formula is as follows:
s
2
=
P
N
i=1
x
i
÷ m ( )
2
N
(2.5.3)
Standard Deviation The variance represents squared units and, therefore, is not
an appropriate measure of dispersion when we wish to express this concept in terms of the
original units. To obtain a measure of dispersion in original units, we merely take the square
root of the variance. The result is called the standard deviation. In general, the standard
deviation of a sample is given by
s =
ffiffiffiffi
s
2
_
=
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P
n
i=1
x
i
÷x ( )
2
n ÷ 1
v
u
u
u
t
(2.5.4)
The standard deviation of a finite population is obtained by taking the square root of the
quantity obtained by Equation 2.5.3, and is represented by s.
The Coefficient of Variation The standard deviation is useful as a measure of
variation within a given set of data. When one desires to compare the dispersion in two sets
of data, however, comparing the two standard deviations may lead to fallacious results. It
may be that the two variables involved are measured in different units. For example, we
may wish to know, for a certain population, whether serum cholesterol levels, measured in
milligrams per 100 ml, are more variable than body weight, measured in pounds.
Furthermore, although the same unit of measurement is used, the two means may be
quite different. If we compare the standard deviation of weights of first-grade children with
the standard deviation of weights of high school freshmen, we may find that the latter
standard deviation is numerically larger than the former, because the weights themselves
are larger, not because the dispersion is greater.
2.5 DESCRIPTIVE STATISTICS: MEASURES OF DISPERSION 45
3GC02 11/07/2012 21:59:8 Page 46
What is needed in situations like these is a measure of relative variation rather than
absolute variation. Such a measure is found in the coefficient of variation, which expresses
the standard deviation as a percentage of the mean. The formula is given by
C:V: =
s
x
100 ( )% (2.5.5)
We see that, since the mean and standard deviations are expressed in the same unit of
measurement, the unit of measurement cancels out in computing the coefficient of
variation. What we have, then, is a measure that is independent of the unit of measurement.
EXAMPLE 2.5.3
Suppose two samples of human males yield the following results:
Sample 1 Sample 2
Age 25 years 11 years
Mean weight 145 pounds 80 pounds
Standard deviation 10 pounds 10 pounds
We wish to know which is more variable, the weights of the 25-year-olds or the weights of
the 11-year-olds.
Solution: A comparison of the standard deviations might lead one to conclude that the
two samples possess equal variability. If we compute the coefficients of
variation, however, we have for the 25-year-olds
C:V: =
10
145
100 ( ) = 6:9%
and for the 11-year-olds
C:V: =
10
80
100 ( ) = 12:5%
If we compare these results, we get quite a different impression. It is clear
from this example that variation is much higher in the sample of 11-year-olds
than in the sample of 25-year-olds. &
The coefficient of variation is also useful in comparing the results obtained by
different persons who are conducting investigations involving the same variable. Since the
coefficient of variation is independent of the scale of measurement, it is a useful statistic for
comparing the variability of two or more variables measured on different scales. We could,
for example, use the coefficient of variation to compare the variability in weights of one
sample of subjects whose weights are expressed in pounds with the variability in weights of
another sample of subjects whose weights are expressed in kilograms.
46 CHAPTER 2 DESCRIPTIVE STATISTICS
3GC02 11/07/2012 21:59:8 Page 47
Computer Analysis Computer software packages provide a variety of possibilit-
ies in the calculation of descriptive measures. Figure 2.5.2 shows a printout of the
descriptive measures available from the MINITAB package. The data consist of the
ages from Example 2.4.2.
In the printout Q
1
and Q
3
are the first and third quartiles, respectively. These
measures are described later in this chapter. N stands for the number of data observations,
and N
+
stands for the number of missing values. The term SEMEAN stands for standard
error of the mean. This measure will be discussed in detail in a later chapter. Figure 2.5.3
shows, for the same data, the SAS
®
printout obtained by using the PROC MEANS
statement.
Percentiles and Quartiles The mean and median are special cases of a family
of parameters known as location parameters. These descriptive measures are called
location parameters because they can be used to designate certain positions on the
horizontal axis when the distribution of a variable is graphed. In that sense the so-called
location parameters “locate” the distribution on the horizontal axis. For example, a
distribution with a median of 100 is located to the right of a distribution with a median
of 50 when the two distributions are graphed. Other location parameters include percentiles
and quartiles. We may define a percentile as follows:
DEFINITION
Given a set of n observations x
1
; x
2
; . . . x
n
, the pth percentile P is the
value of X such that p percent or less of the observations are less than P
and 100 ÷ p ( ) percent or less of the observations are greater than P.
Variable N N* Mean SE Mean StDev Minimum Q1 Median Q3 Maximum
C1 10 0 56.00 3.00 9.49 38.00 48.25 58.00 64.25 66.00
FIGURE 2.5.2 Printout of descriptive measures computed from the sample of ages in
Example 2.4.2, MINITAB software package.
The MEANS Procedure
Analysis Variable: Age
N Mean Std Dev Minimum Maximum
10 56.0000000 9.4868330 38.0000000 66.0000000
Coeff of
Std Error Sum Variance Variation
3.0000000 560.0000000 90.0000000 16.9407732
FIGURE 2.5.3 Printout of descriptive measures computed from the sample of ages in
Example 2.4.2, SAS
®
software package.
2.5 DESCRIPTIVE STATISTICS: MEASURES OF DISPERSION 47
3GC02 11/07/2012 21:59:8 Page 48
Subscripts on P serve to distinguish one percentile from another. The 10th percentile,
for example, is designated P
10
, the 70th is designated P
70
, and so on. The 50th percentile is
the median and is designated P
50
. The 25th percentile is often referred to as the first quartile
and denoted Q
1
. The 50th percentile (the median) is referred to as the second or middle
quartile and written Q
2
, and the 75th percentile is referred to as the third quartile, Q
3
.
When we wish to find the quartiles for a set of data, the following formulas are used:
Q
1
=
n ÷ 1
4
th ordered observation
Q
2
=
2 n ÷ 1 ( )
4
=
n ÷ 1
2
th ordered observation
Q
3
=
3 n ÷ 1 ( )
4
th ordered observation
(2.5.6)
It should be noted that the equations shown in 2.5.6 determine the positions of the quartiles
in a data set, not the values of the quartiles. It should also be noted that though there is a
universal way to calculate the median (Q
2
), there are a variety of ways to calculate Q
1
, and
Q
2
values. For example, SAS provides for a total of five different ways to calculate the
quartile values, and other programs implement even different methods. For a discussion of
the various methods for calculating quartiles, interested readers are referred to the article
by Hyndman and Fan (3). To illustrate, note that the printout in MINITAB in Figure 2.5.2
shows Q
1
=48.25 and Q
3
=64.25, whereas program R yields the values Q
1
=52.75 and
Q
3
=63.25.
Interquartile Range As we have seen, the range provides a crude measure of
the variability present in a set of data. A disadvantage of the range is the fact that it is
computed from only two values, the largest and the smallest. A similar measure that
reflects the variability among the middle 50 percent of the observations in a data set is
the interquartile range.
DEFINITION
The interquartile range (IQR) is the difference between the third and first
quartiles: that is,
IQR = Q
3
÷ Q
1
(2.5.7)
A large IQR indicates a large amount of variability among the middle 50 percent of the
relevant observations, and a small IQR indicates a small amount of variability among the
relevant observations. Since such statements are rather vague, it is more informative to
compare the interquartile range with the range for the entire data set. A comparison may
be made by forming the ratio of the IQR to the range (R) and multiplying by 100. That is,
100 (IQR/R) tells us what percent the IQR is of the overall range.
Kurtosis Just as we may describe a distribution in terms of skewness, we may
describe a distribution in terms of kurtosis.
9
>
>
>
>
>
>
>
=
>
>
>
>
>
>
>
;
48 CHAPTER 2 DESCRIPTIVE STATISTICS
3GC02 11/07/2012 21:59:8 Page 49
DEFINITION
Kurtosis is a measure of the degree to which a distribution is “peaked” or
flat in comparison to a normal distribution whose graph is characterized
by a bell-shaped appearance.
A distribution, in comparison to a normal distribution, may possesses an excessive
proportion of observations in its tails, so that its graph exhibits a flattened appearance.
Such a distribution is said to be platykurtic. Conversely, a distribution, in comparison to a
normal distribution, may possess a smaller proportion of observations in its tails, so that its
graph exhibits a more peaked appearance. Such a distribution is said to be leptokurtic. A
normal, or bell-shaped distribution, is said to be mesokurtic.
Kurtosis can be expressed as
Kurtosis =
n
P
n
i=1
x
i
÷ x ( )
4
P
n
i=1
x
i
÷x ( )
2

2
÷ 3 =
n
P
n
i=1
x
i
÷x ( )
4
n ÷ 1 ( )
2
s
4
÷ 3 (2.5.8)
Manual calculation using Equation 2.5.8 is usually not necessary, since most statistical
packages calculate and report information regarding kurtosis as part of the descriptive
statistics for a data set. Note that each of the two parts of Equation 2.5.8 has been reduced
by 3. A perfectly mesokurtic distribution has a kurtosis measure of 3 based on the equation.
Most computer algorithms reduce the measure by 3, as is done in Equation 2.5.8, so that the
kurtosis measure of a mesokurtic distribution will be equal to 0. A leptokurtic distribution,
then, will have a kurtosis measure > 0, and a platykurtic distribution will have a kurtosis
measure < 0. Be aware that not all computer packages make this adjustment. In such cases,
comparisons with a mesokurtic distribution are made against 3 instead of against 0. Graphs
of distributions representing the three types of kurtosis are shown in Figure 2.5.4.
EXAMPLE 2.5.4
Consider the three distributions shown in Figure 2.5.4. Given that the histograms represent
frequency counts, the data can be easily re-created and entered into a statistical package.
For example, observation of the “mesokurtic” distribution would yield the following data:
1, 2, 2, 3, 3, 3, 3, 3, . . . , 9, 9, 9, 9, 9, 10, 10, 11. Values can be obtained from the other
distributions in a similar fashion. Using SPSS software, the following descriptive statistics
were obtained for these three distributions:
Mesokurtic Leptokurtic Platykurtic
Mean 6.0000 6.0000 6.0000
Median 6.0000 6.0000 6.0000
Mode 6.00 6.00 6.00
Kurtosis .000 .608 ÷1.158
&
2.5 DESCRIPTIVE STATISTICS: MEASURES OF DISPERSION 49
3GC02 11/07/2012 21:59:9 Page 50
Box-and-Whisker Plots A useful visual device for communicating the infor-
mation contained in a data set is the box-and-whisker plot. The construction of a box-and-
whisker plot (sometimes called, simply, a boxplot) makes use of the quartiles of a data set
and may be accomplished by following these five steps:
1. Represent the variable of interest on the horizontal axis.
2. Drawa box in the space above the horizontal axis in such a way that the left end of the
box aligns with the first quartile Q
1
and the right end of the box aligns with the third
quartile Q
3
.
3. Divide the box into two parts by a vertical line that aligns with the median Q
2
.
4. Draw a horizontal line called a whisker from the left end of the box to a point that
aligns with the smallest measurement in the data set.
5. Draw another horizontal line, or whisker, from the right end of the box to a point that
aligns with the largest measurement in the data set.
Examination of a box-and-whisker plot for a set of data reveals information
regarding the amount of spread, location of concentration, and symmetry of the data.
The following example illustrates the construction of a box-and-whisker plot.
EXAMPLE 2.5.5
Evans et al. (A-7) examined the effect of velocity on ground reaction forces (GRF) in
dogs with lameness from a torn cranial cruciate ligament. The dogs were walked and
trotted over a force platform, and the GRF was recorded during a certain phase of their
performance. Table 2.5.1 contains 20 measurements of force where each value shown is
the mean of five force measurements per dog when trotting.
FIGURE 2.5.4 Three histograms representing kurtosis.
TABLE 2.5.1 GRF Measurements When Trotting of 20 Dogs with a Lame
Ligament
14.6 24.3 24.9 27.0 27.2 27.4 28.2 28.8 29.9 30.7
31.5 31.6 32.3 32.8 33.3 33.6 34.3 36.9 38.3 44.0
Source: Data provided courtesy of Richard Evans, Ph.D.
50 CHAPTER 2 DESCRIPTIVE STATISTICS
3GC02 11/07/2012 21:59:9 Page 51
Solution: The smallest and largest measurements are 14.6 and 44, respectively. The
first quartile is the Q
1
= 20 ÷ 1 ( )=4 = 5:25th measurement, which is
27:2 ÷ :25 ( ) 27:4 ÷ 27:2 ( ) = 27:25. The median is the Q
2
÷ 20 ÷ 1 ( )=2 =
10:5 th measurement or 30:7 ÷ :5 ( ) 31:5 ÷ 30:7 ( ) = 31:1; and the third
quartile is the Q
3
÷ 3 20 ÷ 1 ( )=4 = 15:75th measurement, which is equal
to 33:3 ÷ :75 ( ) 33:6 ÷ 33:3 ( ) = 33:525. The interquartile range is
IQR = 33:525 ÷ 27:25 = 6:275. The range is 29.4, and the IQR is
100 6:275=29:4 ( ) = 21 percent of the range. The resulting box-and-whisker
plot is shown in Figure 2.5.5. &
Examination of Figure 2.5.5 reveals that 50 percent of the measurements are between
about 27 and 33, the approximate values of the first and third quartiles, respectively. The
vertical bar inside the box shows that the median is about 31.
Many statistical software packages have the capability of constructing box-and-
whisker plots. Figure 2.5.6 shows one constructed by MINITAB and one constructed by
NCSS fromthe data of Table 2.5.1. The procedure to produce the MINTABplot is shown in
Figure 2.5.7. The asterisks in Figure 2.5.6 alert us to the fact that the data set contains one
unusually large and one unusually small value, called outliers. The outliers are the dogs
that generated forces of 14.6 and 44. Figure 2.5.6 illustrates the fact that box-and-whisker
plots may be displayed vertically as well as horizontally.
An outlier, or a typical observation, may be defined as follows.
14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 46 44
GRF Measurements
FIGURE 2.5.5 Box-and-whisker plot for Example 2.5.5.
*
*
F
o
r
c
e
45
45
40
35
30
25
20
15
35
25
15
FIGURE 2.5.6 Box-and-whisker plot constructed by MINITAB (left) and by R (right) from the
data of Table 2.5.1.
2.5 DESCRIPTIVE STATISTICS: MEASURES OF DISPERSION 51
3GC02 11/07/2012 21:59:9 Page 52
DEFINITION
An outlier is an observation whose value, x, either exceeds the value
of the third quartile by a magnitude greater than 1.5(IQR) or is less than
the value of the first quartile by a magnitude greater than 1.5(IQR).
That is, an observation of x > Q
3
÷ 1:5 IQR ( ) or an observation of
x < Q
1
÷ 1:5 IQR ( ) is called an outlier.
For the data in Table 2.5.1 we may use the previously computed values of Q
1
; Q
3
,
and IQR to determine how large or how small a value would have to be in order to be
considered an outlier. The calculations are as follows:
x < 27:25 ÷ 1:5 6:275 ( ) = 17:8375 and x > 33:525 ÷ 1:5 6:275 ( ) = 42:9375
For the data in Table 2.5.1, then, an observed value smaller than 17.8375 or larger than
42.9375 would be considered an outlier.
The SAS
®
statement PROC UNIVARIATE may be used to obtain a box-and-whisker
plot. The statement also produces other descriptive measures and displays, including stem-
and-leaf plots, means, variances, and quartiles.
Exploratory Data Analysis Box-and-whisker plots and stem-and-leaf displays
are examples of what are known as exploratory data analysis techniques. These tech-
niques, made popular as a result of the work of Tukey (4), allowthe investigator to examine
data in ways that reveal trends and relationships, identify unique features of data sets, and
facilitate their description and summarization.
EXERCISES
For each of the data sets in the following exercises compute (a) the mean, (b) the median, (c) the
mode, (d) the range, (e) the variance, (f) the standard deviation, (g) the coefficient of variation, and (h)
the interquartile range. Treat each data set as a sample. For those exercises for which you think it
would be appropriate, construct a box-and-whisker plot and discuss the usefulness in understanding
the nature of the data that this device provides. For each exercise select the measure of central
tendency that you think would be most appropriate for describing the data. Give reasons to justify
your choice.
: d n a m m o c n o i s s e S : x o b g o l a i D
Stat EDA Boxplot Simple MTB > Boxplot ‘Force’;
Click OK. SUBC> IQRbox;
SUBC> Outlier.
Type Force Graph Variables.
Click OK.
FIGURE 2.5.7 MINITAB procedure to produce Figure 2.5.6.
52 CHAPTER 2 DESCRIPTIVE STATISTICS
3GC02 11/07/2012 21:59:10 Page 53
2.5.1 Porcellini et al. (A-8) studied 13 HIV-positive patients who were treated with highly active
antiretroviral therapy (HAART) for at least 6 months. The CD4 T cell counts ×10
6
=L
À Á
at baseline
for the 13 subjects are listed below.
230 205 313 207 227 245 173
58 103 181 105 301 169
Source: Simona Porcellini, Guiliana Vallanti, Silvia Nozza,
Guido Poli, Adriano Lazzarin, Guiseppe Tambussi,
Antonio Grassia, “Improved Thymopoietic Potential in
Aviremic HIV Infected Individuals with HAART by
Intermittent IL-2 Administration,” AIDS, 17 (2003),
1621–1630.
2.5.2 Shair and Jasper (A-9) investigated whether decreasing the venous return in young rats would affect
ultrasonic vocalizations (USVs). Their research showed no significant change in the number of
ultrasonic vocalizations when blood was removed from either the superior vena cava or the carotid
artery. Another important variable measured was the heart rate (bmp) during the withdrawal of blood.
The table below presents the heart rate of seven rat pups from the experiment involving the carotid
artery.
500 570 560 570 450 560 570
Source: Harry N. Shair and Anna Jasper, “Decreased
Venous Return Is Neither Sufficient nor Necessary to Elicit
Ultrasonic Vocalization of Infant Rat Pups,” Behavioral
Neuroscience, 117 (2003), 840–853.
2.5.3 Butz et al. (A-10) evaluated the duration of benefit derived from the use of noninvasive positive-
pressure ventilation by patients with amyotrophic lateral sclerosis on symptoms, quality of life, and
survival. One of the variables of interest is partial pressure of arterial carbon dioxide (PaCO
2
). The
values below (mm Hg) reflect the result of baseline testing on 30 subjects as established by arterial
blood gas analyses.
40.0 47.0 34.0 42.0 54.0 48.0 53.6 56.9 58.0 45.0
54.5 54.0 43.0 44.3 53.9 41.8 33.0 43.1 52.4 37.9
34.5 40.1 33.0 59.9 62.6 54.1 45.7 40.6 56.6 59.0
Source: M. Butz, K. H. Wollinsky, U. Widemuth-Catrinescu, A. Sperfeld,
S. Winter, H. H. Mehrkens, A. C. Ludolph, and H. Schreiber, “Longitudinal Effects
of Noninvasive Positive-Pressure Ventilation in Patients with Amyotrophic Lateral
Sclerosis,” American Journal of Medical Rehabilitation, 82 (2003), 597–604.
2.5.4 According to Starch et al. (A-11), hamstring tendon grafts have been the “weak link” in anterior
cruciate ligament reconstruction. In a controlled laboratory study, they compared two techniques for
reconstruction: either an interference screw or a central sleeve and screw on the tibial side. For eight
cadaveric knees, the measurements below represent the required force (in newtons) at which initial
failure of graft strands occurred for the central sleeve and screw technique.
172.5 216.63 212.62 98.97 66.95 239.76 19.57 195.72
Source: David W. Starch, Jerry W. Alexander, Philip C. Noble, Suraj Reddy, and David M.
Lintner, “Multistranded Hamstring Tendon Graft Fixation with a Central Four-Quadrant or
a Standard Tibial Interference Screw for Anterior Cruciate Ligament Reconstruction,” The
American Journal of Sports Medicine, 31 (2003), 338–344.
EXERCISES 53
3GC02 11/07/2012 21:59:10 Page 54
2.5.5 Cardosi et al. (A-12) performed a 4-year retrospective review of 102 women undergoing radical
hysterectomy for cervical or endometrial cancer. Catheter-associated urinary tract infection was
observed in 12 of the subjects. Below are the numbers of postoperative days until diagnosis of the
infection for each subject experiencing an infection.
16 10 49 15 6 15
8 19 11 22 13 17
Source: Richard J. Cardosi, Rosemary Cardosi, Edward
C. Grendys Jr., James V. Fiorica, and Mitchel S. Hoffman,
“Infectious Urinary Tract Morbidity with Prolonged
Bladder Catheterization After Radical Hysterectomy,” American
Journal of Obstetrics and Gynecology,
189 (2003), 380–384.
2.5.6 The purpose of a study by Nozawa et al. (A-13) was to evaluate the outcome of surgical repair of pars
interarticularis defect by segmental wire fixation in young adults with lumbar spondylolysis. The
authors found that segmental wire fixation historically has been successful in the treatment of
nonathletes with spondylolysis, but no information existed on the results of this type of surgery in
athletes. In a retrospective study, the authors found 20 subjects who had the surgery between 1993 and
2000. For these subjects, the data below represent the duration in months of follow-up care after the
operation.
103 68 62 60 60 54 49 44 42 41
38 36 34 30 19 19 19 19 17 16
Source: Satoshi Nozawa, Katsuji Shimizu, Kei Miyamoto, and
Mizuo Tanaka, “Repair of Pars Interarticularis Defect
by Segmental Wire Fixation in Young Athletes with
Spondylolysis,” American Journal of Sports Medicine, 31 (2003),
359–364.
2.5.7 See Exercise 2.3.1.
2.5.8 See Exercise 2.3.2.
2.5.9 See Exercise 2.3.3.
2.5.10 See Exercise 2.3.4.
2.5.11 See Exercise 2.3.5.
2.5.12 See Exercise 2.3.6.
2.5.13 See Exercise 2.3.7.
2.5.14 In a pilot study, Huizinga et al. (A-14) wanted to gain more insight into the psychosocial
consequences for children of a parent with cancer. For the study, 14 families participated in
semistructured interviews and completed standardized questionnaires. Below is the age of the
sick parent with cancer (in years) for the 14 families.
37 48 53 46 42 49 44
38 32 32 51 51 48 41
Source: Gea A. Huizinga, Winette T.A. van der Graaf, Annemike
Visser, Jos S. Dijkstra, and Josette E. H. M. Hoekstra-Weebers, “Psychosocial
Consequences for Children of a Parent with Cancer,” Cancer Nursing, 26
(2003), 195–202.
54 CHAPTER 2 DESCRIPTIVE STATISTICS
3GC02 11/07/2012 21:59:11 Page 55
2.6 SUMMARY
In this chapter various descriptive statistical procedures are explained. These include the
organization of data by means of the ordered array, the frequency distribution, the relative
frequency distribution, the histogram, and the frequency polygon. The concepts of
central tendency and variation are described, along with methods for computing their
more common measures: the mean, median, mode, range, variance, and standard
deviation. The reader is also introduced to the concepts of skewness and kurtosis,
and to exploratory data analysis through a description of stem-and-leaf displays and box-
and-whisker plots.
We emphasize the use of the computer as a tool for calculating descriptive measures
and constructing various distributions from large data sets.
SUMMARY OF FORMULAS FOR CHAPTER 2
Formula
Number Name Formula
2.3.1 Class interval width
using Sturges’s Rule
w =
R
k
2.4.1 Mean of a population
m =
P
N
i=1
x
i
N
2.4.2 Skewness
Skewness =
ffiffiffi
n
_ P
n
i=1
x
i
÷x ( )
3
P
n
i=1
x
i
÷x ( )
2

3
2
=
ffiffiffi
n
_ P
n
i=1
x
i
÷x ( )
3
n ÷ 1 ( )
ffiffiffiffiffiffiffiffiffiffiffi
n ÷ 1
_
s
3
2.4.2 Mean of a sample
x =
P
n
i=1
x
i
n
2.5.1 Range R = x
L
÷ x
s
2.5.2 Sample variance
s
2
=
P
n
i=1
x
i
÷x ( )
2
n ÷ 1
2.5.3 Population variance
s
2
=
P
N
i=1
x
i
÷ m ( )
2
N
(Continued )
SUMMARY OF FORMULAS FOR CHAPTER 2 55
3GC02 11/07/2012 21:59:11 Page 56
2.5.4 Standard deviation
s =
ffiffiffiffi
s
2
_
=
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P
n
i=1
x
i
÷x ( )
2
n ÷ 1
v
u
u
u
t
2.5.5 Coefficient of variation
C:V: =
s
x
100 ( )%
2.5.6 Quartile location in
ordered array
Q
1
=
1
4
n ÷ 1 ( )
Q
2
=
1
2
n ÷ 1 ( )
Q
3
=
3
4
n ÷ 1 ( )
2.5.7 Interquartile range IQR = Q
3
÷ Q
1
2.5.8 Kurtosis
Kurtosis =
P
n
i=1
x
i
÷x ( )
4
P
n
i=1
x
i
÷x ( )
2

2
÷ 3 =
n
P
n
i=1
x
i
÷x ( )
4
n ÷ 1 ( )
2
s
4
÷ 3
Symbol Key
v
C:V: = coefficient of variation
v
IQR = Interquartile range
v
k = number of class intervals
v
m = population mean
v
N = population size
v
n = sample size
v
n ÷ 1 ( ) = degrees of freedom
v
Q
1
= first quartile
v
Q
2
= second quartile = median
v
Q
3
= third quartile
v
R = range
v
s = standard deviation
v
s
2
= sample variance
v
s
2
= population variance
v
x
i
= i
th
data observation
v
x
L
= largest data point
v
x
S
= smallest data point
v
x = sample mean
v
w = class width
56 CHAPTER 2 DESCRIPTIVE STATISTICS
3GC02 11/07/2012 21:59:11 Page 57
REVIEWQUESTIONS ANDEXERCISES
1. Define:
(a) Stem-and-leaf display (b) Box-and-whisker plot
(c) Percentile (d) Quartile
(e) Location parameter (f) Exploratory data analysis
(g) Ordered array (h) Frequency distribution
(i) Relative frequency distribution (j) Statistic
(k) Parameter (l) Frequency polygon
(m) True class limits (n) Histogram
2. Define and compare the characteristics of the mean, the median, and the mode.
3. What are the advantages and limitations of the range as a measure of dispersion?
4. Explain the rationale for using n ÷ 1 to compute the sample variance.
5. What is the purpose of the coefficient of variation?
6. What is the purpose of Sturges’s rule?
7. What is another name for the 50th percentile (second or middle quartile)?
8. Describe from your field of study a population of data where knowledge of the central tendency and
dispersion would be useful. Obtain real or realistic synthetic values fromthis population and compute
the mean, median, mode, variance, and standard deviation.
9. Collect a set of real, or realistic, data fromyour field of study and construct a frequency distribution, a
relative frequency distribution, a histogram, and a frequency polygon.
10. Compute the mean, median, mode, variance, and standard deviation for the data in Exercise 9.
11. Find an article in a journal from your field of study in which some measure of central tendency and
dispersion have been computed.
12. The purpose of a study by Tam et al. (A-15) was to investigate the wheelchair maneuvering in
individuals with lower-level spinal cord injury (SCI) and healthy controls. Subjects used a modified
wheelchair to incorporate a rigid seat surface to facilitate the specified experimental measurements.
Interface pressure measurement was recorded by using a high-resolution pressure-sensitive mat with
a spatial resolution of 4 sensors per square centimeter taped on the rigid seat support. During static
sitting conditions, average pressures were recorded under the ischial tuberosities. The data for
measurements of the left ischial tuberosity (in mm Hg) for the SCI and control groups are shown
below.
Control 131 115 124 131 122 117 88 114 150 169
SCI 60 150 130 180 163 130 121 119 130 148
Source: Eric W. Tam, Arthur F. Mak, Wai Nga Lam, John H. Evans, and York Y.
Chow, “Pelvic Movement and Interface Pressure Distribution During Manual Wheel-
chair Propulsion,” Archives of Physical Medicine and Rehabilitation, 84 (2003),
1466–1472.
(a) Find the mean, median, variance, and standard deviation for the controls.
(b) Find the mean, median variance, and standard deviation for the SCI group.
REVIEWQUESTIONS AND EXERCISES 57
3GC02 11/07/2012 21:59:12 Page 58
(c) Construct a box-and-whisker plot for the controls.
(d) Construct a box-and-whisker plot for the SCI group.
(e) Do you believe there is a difference in pressure readings for controls and SCI subjects in this
study?
13. Johnson et al. (A-16) performed a retrospective review of 50 fetuses that underwent open fetal
myelomeningocele closure. The data below show the gestational age in weeks of the 50 fetuses
undergoing the procedure.
25 25 26 27 29 29 29 30 30 31
32 32 32 33 33 33 33 34 34 34
35 35 35 35 35 35 35 35 35 36
36 36 36 36 36 36 36 36 36 36
36 36 36 36 36 36 36 36 37 37
Source: Mark P. Johnson, Leslie N. Sutton, Natalie Rintoul, Timothy M. Crom-
bleholme, Alan W. Flake, Lori J. Howell, Holly L. Hedrick, R. Douglas Wilson, and
N. Scott Adzick, “Fetal Myelomeningocele Repair: Short-TermClinical Outcomes,”
American Journal of Obstetrics and Gynecology, 189 (2003), 482–487.
(a) Construct a stem-and-leaf plot for these gestational ages.
(b) Based on the stem-and-leaf plot, what one word would you use to describe the nature of the data?
(c) Why do you think the stem-and-leaf plot looks the way it does?
(d) Compute the mean, median, variance, and standard deviation.
14. The following table gives the age distribution for the number of deaths in New York State due to
accidents for residents age 25 and older.
Age (Years)
Number of Deaths
Due to Accidents
25–34 393
35–44 514
45–54 460
55–64 341
65–74 365
75–84 616
85–94
+
618
Source: New York State Department of Health, Vital
Statistics of New York State, 2000, Table 32: Death
Summary Information by Age.
+
May include deaths due to accident for adults over
age 94.
For these data construct a cumulative frequency distribution, a relative frequency distribution, and a
cumulative relative frequency distribution.
15. Krieser et al. (A-17) examined glomerular filtration rate (GFR) in pediatric renal transplant
recipients. GFR is an important parameter of renal function assessed in renal transplant recipients.
The following are measurements from 19 subjects of GFR measured with diethylenetriamine penta-
acetic acid. (Note: some subjects were measured more than once.)
58 CHAPTER 2 DESCRIPTIVE STATISTICS
3GC02 11/07/2012 21:59:12 Page 59
18 42
21 43
21 43
23 48
27 48
27 51
30 55
32 58
32 60
32 62
36 67
37 68
41 88
42 63
Source: Data provided courtesy of D. M. Z. Krieser, M.D.
(a) Compute mean, median, variance, standard deviation, and coefficient of variation.
(b) Construct a stem-and-leaf display.
(c) Construct a box-and-whisker plot.
(d) What percentage of the measurements is within one standard deviation of the mean? Two
standard deviations? Three standard deviations?
16. The following are the cystatin C levels (mg/L) for the patients described in Exercise 15 (A-17).
Cystatin C is a cationic basic protein that was investigated for its relationship to GFR levels. In
addition, creatinine levels are also given. (Note: Some subjects were measured more than once.)
Cystatin C (mg/L) Creatinine (mmol/L)
1.78 4.69 0.35 0.14
2.16 3.78 0.30 0.11
1.82 2.24 0.20 0.09
1.86 4.93 0.17 0.12
1.75 2.71 0.15 0.07
1.83 1.76 0.13 0.12
2.49 2.62 0.14 0.11
1.69 2.61 0.12 0.07
1.85 3.65 0.24 0.10
1.76 2.36 0.16 0.13
1.25 3.25 0.17 0.09
1.50 2.01 0.11 0.12
2.06 2.51 0.12 0.06
2.34
Source: Data provided courtesy of D. M. Z. Krieser, M.D.
(a) For each variable, compute the mean, median, variance, standard deviation, and coefficient of
variation.
(b) For each variable, construct a stem-and-leaf display and a box-and-whisker plot.
(c) Which set of measurements is more variable, cystatin C or creatinine? On what do you base your
answer?
REVIEWQUESTIONS AND EXERCISES 59
3GC02 11/07/2012 21:59:12 Page 60
17. Give three synonyms for variation (variability).
18. The following table shows the age distribution of live births in Albany County, New York, for
2000.
Mother’s Age Number of Live Births
10–14 7
15–19 258
20–24 585
25–29 841
30–34 981
35–39 526
40–44 99
45–49
+
4
Source: New York State Department of Health, Annual
Vital Statistics 2000, Table 7, Live Births by Resident
County and Mother’s Age.
+
May include live births to mothers over age 49.
For these data construct a cumulative frequency distribution, a relative frequency distribution, and a
cumulative relative frequency distribution.
19. Spivack (A-18) investigated the severity of disease associated with C. difficilie in pediatric inpatients.
One of the variables they examined was number of days patients experienced diarrhea. The data for
the 22 subjects in the study appear below. Compute the mean, median, variance, and standard
deviation.
3 11 3 4 14 2 4 5 3 11 2
2 3 2 1 1 7 2 1 1 3 2
Source: Jordan G. Spivack, Stephen C. Eppes, and Joel D. Klien,
“Clostridium Difficile–Associated Diarrhea in a Pediatric
Hospital,” Clinical Pediatrics, 42 (2003), 347–352.
20. Express in words the following properties of the sample mean:
(a) S x ÷x ( )
2
= a minimum
(b) nx = Sx
(c) S x ÷x ( ) = 0
21. Your statistics instructor tells you on the first day of class that there will be five tests during the term.
From the scores on these tests for each student, the instructor will compute a measure of central
tendency that will serve as the student’s final course grade. Before taking the first test, you must
choose whether you want your final grade to be the mean or the median of the five test scores. Which
would you choose? Why?
22. Consider the following possible class intervals for use in constructing a frequency distribution of
serum cholesterol levels of subjects who participated in a mass screening:
(a) 50–74 (b) 50–74 (c) 50–75
75–99 75–99 75–100
100–149 100–124 100–125
150–174 125–149 125–150
60 CHAPTER 2 DESCRIPTIVE STATISTICS
3GC02 11/07/2012 21:59:13 Page 61
175–199 150–174 150–175
200–249 175–199 175–200
250–274 200–224 200–225
etc. 225–249 225–250
etc. etc.
Which set of class intervals do you think is most appropriate for the purpose? Why? State specifically
for each one why you think the other two are less desirable.
23. On a statistics test students were asked to construct a frequency distribution of the blood creatine
levels (units/liter) for a sample of 300 healthy subjects. The mean was 95, and the standard deviation
was 40. The following class interval widths were used by the students:
(a) 1 (d) 15
(b) 5 (e) 20
(c) 10 (f) 25
Comment on the appropriateness of these choices of widths.
24. Give a health sciences-related example of a population of measurements for which the mean would
be a better measure of central tendency than the median.
25. Give a health sciences-related example of a population of measurements for which the median would
be a better measure of central tendency than the mean.
26. Indicate for the following variables which you think would be a better measure of central tendency,
the mean, the median, or mode, and justify your choice:
(a) Annual incomes of licensed practical nurses in the Southeast.
(b) Diagnoses of patients seen in the emergency department of a large city hospital.
(c) Weights of high-school male basketball players.
27. Refer to Exercise 2.3.11. Compute the mean, median, variance, standard deviation, first quartile, third
quartile, and interquartile range. Construct a boxplot of the data. Are the mode, median, and mean
equal? If not, explain why. Discuss the data in terms of variability. Compare the IQR with the range.
What does the comparison tell you about the variability of the observations?
28. Refer to Exercise 2.3.12. Compute the mean, median, variance, standard deviation, first quartile, third
quartile, and interquartile range. Construct a boxplot of the data. Are the mode, median, and mean
equal? If not, explain why. Discuss the data in terms of variability. Compare the IQR with the range.
What does the comparison tell you about the variability of the observations?
29. Thilothammal et al. (A-19) designed a study to determine the efficacy of BCG (bacillus
Calmette-Guerin) vaccine in preventing tuberculous meningitis. Among the data collected on
each subject was a measure of nutritional status (actual weight expressed as a percentage of
expected weight for actual height). The following table shows the nutritional status values of the
107 cases studied.
73.3 54.6 82.4 76.5 72.2 73.6 74.0
80.5 71.0 56.8 80.6 100.0 79.6 67.3
50.4 66.0 83.0 72.3 55.7 64.1 66.3
50.9 71.0 76.5 99.6 79.3 76.9 96.0
64.8 74.0 72.6 80.7 109.0 68.6 73.8
74.0 72.7 65.9 73.3 84.4 73.2 70.0
72.8 73.6 70.0 77.4 76.4 66.3 50.5
REVIEWQUESTIONS AND EXERCISES 61
3GC02 11/07/2012 21:59:14 Page 62
72.0 97.5 130.0 68.1 86.4 70.0 73.0
59.7 89.6 76.9 74.6 67.7 91.9 55.0
90.9 70.5 88.2 70.5 74.0 55.5 80.0
76.9 78.1 63.4 58.8 92.3 100.0 84.0
71.4 84.6 123.7 93.7 76.9 79.6
45.6 92.5 65.6 61.3 64.5 72.7
77.5 76.9 80.2 76.9 88.7 78.1
60.6 59.0 84.7 78.2 72.4 68.3
67.5 76.9 82.6 85.4 65.7 65.9
Source: Data provided courtesy of Dr. N. Thilothammal.
(a) For these data compute the following descriptive measures: mean, median, mode, variance,
standard deviation, range, first quartile, third quartile, and IQR.
(b) Construct the following graphs for the data: histogram, frequency polygon, stem-and-leaf plot,
and boxplot.
(c) Discuss the data in terms of variability. Compare the IQR with the range. What does the
comparison tell you about the variability of the observations?
(d) What proportion of the measurements are within one standard deviation of the mean? Two
standard deviations of the mean? Three standard deviations of the mean?
(e) What proportion of the measurements are less than 100?
(f) What proportion of the measurements are less than 50?
Exer cises for Use wit h Large Data Set s Availableon th eFollowing Websit e: www .wiley.com/
c ollege/daniel
1. Refer to the dataset NCBIRTH800. The North Carolina State Center for Health Statistics and
Howard W. Odum Institute for Research in Social Science at the University of North Carolina at
Chapel Hill (A-20) make publicly available birth and infant death data for all children born in the
state of North Carolina. These data can be accessed at www.irss.unc.edu/ncvital/bfd1down.html.
Records on birth data go back to 1968. This comprehensive data set for the births in 2001 contains
120,300 records. The data represents a random sample of 800 of those births and selected variables.
The variables are as follows:
Variable Label Description
PLURALITY Number of children born of the pregnancy
SEX Sex of child 1 = male; 2 = female ( )
MAGE Age of mother (years)
WEEKS Completed weeks of gestation (weeks)
MARITAL Marital status 1 = married; 2 = not married ( )
RACEMOM Race of mother (0 = other non-White, 1 = White; 2 = Black; 3 = American
Indian, 4 = Chinese; 5 = Japanese; 6 = Hawaiian; 7 = Filipino; 8 = Other
Asian or Pacific Islander)
HISPMOM Mother of Hispanic origin (C = Cuban; M = Mexican; N = Non-Hispanic,
O = other and unknown Hispanic, P = Puerto Rican, S = Central=South
American, U = not classifiable)
GAINED Weight gained during pregnancy (pounds)
SMOKE 0 = mother did not smoke during pregnancy
1 = mother did smoke during pregnancy
62 CHAPTER 2 DESCRIPTIVE STATISTICS
3GC02 11/07/2012 21:59:14 Page 63
DRINK 0 = mother did not consume alcohol during pregnancy
1 = mother did consume alcohol during pregnancy
TOUNCES Weight of child (ounces)
TGRAMS Weight of child (grams)
LOW 0 = infant was not low birth weight
1 = infant was low birth weight
PREMIE 0 = infant was not premature
1 = infant was premature
Premature defined at 36 weeks or sooner
For the variables of MAGE, WEEKS, GAINED, TOUNCES, and TGRAMS:0
1. Calculate the mean, median, standard deviation, IQR, and range.
2. For each, construct a histogram and comment on the shape of the distribution.
3. Do the histograms for TOUNCES and TGRAMS look strikingly similar? Why?
4. Construct box-and-whisker plots for all four variables.
5. Construct side-by-side box-and-whisker plots for the variable of TOUNCES for women who
admitted to smoking and women who did not admit to smoking. Do you see a difference in birth
weight in the two groups? Which group has more variability?
6. Construct side-by-side box-and-whisker plots for the variable of MAGE for women who are and are
not married. Do you see a difference in ages in the two groups? Which group has more variability?
Are the results surprising?
7. Calculate the skewness and kurtosis of the data set. What do they indicate?
REFERENCES
Methodology References
1. H. A. STURGES, “The Choice of a Class Interval,” Journal of the American Statistical Association, 21 (1926),
65–66.
2. HELEN M. WALKER, “Degrees of Freedom,” Journal of Educational Psychology, 31 (1940), 253–269.
3. ROB J. HYNDMAN and YANAN FAN, “Sample Quantiles in Statistical Packages,” The American Statistician, 50
(1996), 361–365.
4. JOHN W. TUKEY, Exploratory Data Analysis, Addison-Wesley, Reading, MA, 1977.
Applications References
A-1. FARHAD ATASSI, “Oral Home Care and the Reasons for Seeking Dental Care by Individuals on Renal Dialysis,”
Journal of Contemporary Dental Practice, 3 (2002), 031–041.
A-2. VALLABH JANARDHAN, ROBERT FRIEDLANDER, HOWARD RIINA, and PHILIP EDWIN STIEG, “Identifying Patients at Risk for
Postprocedural Morbidity after Treatment of Incidental Intracranial Aneurysms: The Role of Aneurysm Size and
Location,” Neurosurgical Focus, 13 (2002), 1–8.
A-3. A. HOEKEMA, B. HOVINGA, B. STEGENGA, and L. G. M. De BONT, “Craniofacial Morphology and Obstructive Sleep
Apnoea: A Cephalometric Analysis,” Journal of Oral Rehabilitation, 30 (2003), 690–696.
REFERENCES 63
3GC02 11/07/2012 21:59:14 Page 64
A-4. DAVID H. HOLBEN, “Selenium Content of Venison, Squirrel, and Beef Purchased or Produced in Ohio, a Low
Selenium Region of the United States,” Journal of Food Science, 67 (2002), 431–433.
A-5. ERIK SKJELBO, THEONEST K. MUTABINGWA, IB BYGBJERG, KARIN K. NIELSEN, LARS F. GRAM, and KIM BRØSEN,
“Chloroguanide Metabolism in Relation to the Efficacy in Malaria Prophylaxis and the S-Mephenytoin Oxidation
in Tanzanians,” Clinical Pharmacology & Therapeutics, 59 (1996), 304–311.
A-6. HENRIK SCHMIDT, POUL ERIK MORTENSEN, S

AREN LARS F

ALSGAARD, and ESTHER A. JENSEN, “Autotransfusion after
Coronary Artery Bypass Grafting Halves the Number of Patients Needing Blood Transfusion,” Annals of Thoracic
Surgery, 61 (1996), 1178–1181.
A-7. RICHARD EVANS, WANDA GORDON, and MIKE CONZEMIUS, “Effect of Velocity on Ground Reaction Forces in Dogs
with Lameness Attributable to Tearing of the Cranial Cruciate Ligament,” American Journal of Veterinary
Research, 64 (2003), 1479–1481.
A-8. SIMONA PORCELLINI, GUILIANA VALLANTI, SILVIA NOZZA, GUIDO POLI, ADRIANO LAZZARIN, GUISEPPE TAMBUSSI, and
ANTONIO GRASSIA, “Improved Thymopoietic Potential in Aviremic HIV Infected Individuals with HAART by
Intermittent IL-2 Administration,” AIDS, 17 (2003) 1621–1630.
A-9. HARRY N. SHAIR and ANNA JASPER, “Decreased Venous Return is Neither Sufficient nor Necessary to Elicit
Ultrasonic Vocalization of Infant Rat Pups,” Behavioral Neuroscience, 117 (2003), 840–853.
A-10. M. BUTZ, K. H. WOLLINSKY, U. WIDEMUTH-CATRINESCU, A. SPERFELD, S. WINTER, H. H. MEHRKENS, A. C.
LUDOLPH, and H. SCHREIBER, “Longitudinal Effects of Noninvasive Positive-Pressure Ventilation in
Patients with Amyotophic Lateral Sclerosis,” American Journal of Medical Rehabilitation, 82 (2003),
597–604.
A-11. DAVID W. STARCH, JERRY W. ALEXANDER, PHILIP C. NOBLE, SURAJ REDDY, and DAVID M. LINTNER, “Multistranded
Hamstring Tendon Graft Fixation with a Central Four-Quadrant or a Standard Tibial Interference Screw for
Anterior Cruciate Ligament Reconstruction,” American Journal of Sports Medicine, 31 (2003), 338–344.
A-12. RICHARD J. CARDOSI, ROSEMARY CARDOSI, EDWARD C. GRENDYS Jr., JAMES V. FIORICA, and MITCHEL S. HOFFMAN,
“Infectious Urinary Tract Morbidity with Prolonged Bladder Catheterization after Radical Hysterectomy,”
American Journal of Obstetrics and Gynecology, 189 (2003), 380–384.
A-13. SATOSHI NOZAWA, KATSUJI SHIMIZU, KEI MIYAMOTO, and MIZUO TANAKA, “Repair of Pars Interarticularis Defect by
Segmental Wire Fixation in Young Athletes with Spondylolysis,” American Journal of Sports Medicine, 31
(2003), 359–364.
A-14. GEA A. HUIZINGA, WINETTE T. A. van der GRAAF, ANNEMIKE VISSER, JOS S. DIJKSTRA, and JOSETTE E. H. M.
HOEKSTRA-WEEBERS, “Psychosocial Consequences for Children of a Parent with Cancer,” Cancer Nursing, 26
(2003), 195–202.
A-15. ERIC W. TAM, ARTHUR F. MAK, WAI NGA LAM, JOHN H. EVANS, and YORK Y. CHOW, “Pelvic Movement and Interface
Pressure Distribution During Manual Wheelchair Propulsion,” Archives of Physical Medicine and Rehabilita-
tion, 84 (2003), 1466–1472.
A-16. MARK P. JOHNSON, LESLIE N. SUTTON, NATALIE RINTOUL, TIMOTHY M. CROMBLEHOLME, ALAN W. FLAKE, LORI
J. HOWELL, HOLLY L. HEDRICK, R. DOUGLAS WILSON, and N. SCOTT ADZICK, “Fetal Myelomeningocele
Repair: Short-term Clinical Outcomes,” American Journal of Obstetrics and Gynecology, 189 (2003),
482–487.
A-17. D. M. Z. KRIESER, A. R. ROSENBERG, G. KAINER, and D. NAIDOO, “The Relationship between Serum Creatinine,
Serum Cystatin C, and Glomerular Filtration Rate in Pediatric Renal Transplant Recipients: A Pilot Study,”
Pediatric Transplantation, 6 (2002), 392–395.
A-18. JORDAN G. SPIVACK, STEPHEN C. EPPES, and JOEL D. KLIEN, “Clostridium Difficile&mdash;Associated Diarrhea in a
Pediatric Hospital,” Clinical Pediatrics, 42 (2003), 347–352.
A-19. N. THILOTHAMMAL, P. V. KRISHNAMURTHY, DESMOND K. RUNYAN, and K. BANU, “Does BCG Vaccine Prevent
Tuberculous Meningitis?” Archives of Disease in Childhood, 74 (1996), 144–147.
A-20. North Carolina State Center for Health Statistics and Howard W. Odum Institute for Research in Social Science
at the University of North Carolina at Chapel Hill. Birth data set for 2001 found at www.irss.unc.edu/ncvital/
bfd1down.html. All calculations were performed by John Holcomb and do not represent the findings of the
Center or Institute.
64 CHAPTER 2 DESCRIPTIVE STATISTICS
3GC03 11/07/2012 22:6:32 Page 65
CHAPTER 3
SOME BASIC PROBABILITY
CONCEPTS
CHAPTER OVERVIEW
Probabilitylays thefoundationfor statistical inference. This chapter provides a
brief overviewof the probability concepts necessary for understanding topics
covered in the chapters that follow. It also provides a context for under-
standing the probability distributions used in statistical inference, and intro-
duces the student to several measures commonly found in the medical
literature (e.g., the sensitivity and specificity of a test).
TOPICS
3.1 INTRODUCTION
3.2 TWO VIEWS OF PROBABILITY: OBJECTIVE AND SUBJECTIVE
3.3 ELEMENTARY PROPERTIES OF PROBABILITY
3.4 CALCULATING THE PROBABILITY OF AN EVENT
3.5 BAYES’ THEOREM, SCREENING TESTS, SENSITIVITY, SPECIFICITY,
AND PREDICTIVE VALUE POSITIVE AND NEGATIVE
3.6 SUMMARY
LEARNING OUTCOMES
After studying this chapter, the student will
1. understand classical, relative frequency, and subjective probability.
2. understand the properties of probability and selected probability rules.
3. be able to calculate the probability of an event.
4. be able to apply Bayes’ theorem when calculating screening test results.
3.1 INTRODUCTION
The theory of probability provides the foundation for statistical inference. However, this
theory, which is a branch of mathematics, is not the main concern of this book, and,
consequently, only its fundamental concepts are discussed here. Students who desire to
65
3GC03 11/07/2012 22:6:32 Page 66
pursue this subject should refer to the many books on probability available in most college
and university libraries. The books by Gut (1), Isaac (2), and Larson (3) are recommended.
The objectives of this chapter are to help students gain some mathematical ability in the
area of probability and to assist themin developing an understanding of the more important
concepts. Progress along these lines will contribute immensely to their success in under-
standing the statistical inference procedures presented later in this book.
The concept of probability is not foreign to health workers and is frequently
encountered in everyday communication. For example, we may hear a physician say
that a patient has a 50–50 chance of surviving a certain operation. Another physician may
say that she is 95 percent certain that a patient has a particular disease. A public health
nurse may say that nine times out of ten a certain client will break an appointment. As these
examples suggest, most people express probabilities in terms of percentages. In dealing
with probabilities mathematically, it is more convenient to express probabilities as
fractions. (Percentages result from multiplying the fractions by 100.) Thus, we measure
the probability of the occurrence of some event by a number between zero and one. The
more likely the event, the closer the number is to one; and the more unlikely the event, the
closer the number is to zero. An event that cannot occur has a probability of zero, and an
event that is certain to occur has a probability of one.
Health sciences researchers continually ask themselves if the results of their efforts
could have occurred by chance alone or if some other force was operating to produce the
observed effects. For example, suppose six out of ten patients suffering from some disease
are cured after receiving a certain treatment. Is such a cure rate likely to have occurred if
the patients had not received the treatment, or is it evidence of a true curative effect on the
part of the treatment? We shall see that questions such as these can be answered through the
application of the concepts and laws of probability.
3.2 TWOVIEWS OF PROBABILITY:
OBJECTIVE ANDSUBJECTIVE
Until fairly recently, probability was thought of by statisticians and mathematicians only as
an objective phenomenon derived from objective processes.
The concept of objective probability may be categorized further under the headings
of (1) classical, or a priori, probability, and (2) the relative frequency, or a posteriori,
concept of probability.
Classical Probability The classical treatment of probability dates back to the
17th century and the work of two mathematicians, Pascal and Fermat. Much of this theory
developed out of attempts to solve problems related to games of chance, such as those
involving the rolling of dice. Examples from games of chance illustrate very well the
principles involved in classical probability. For example, if a fair six-sided die is rolled, the
probability that a 1 will be observed is equal to 1=6 and is the same for the other five faces.
If a card is picked at random from a well-shuffled deck of ordinary playing cards, the
probability of picking a heart is 13=52. Probabilities such as these are calculated by the
processes of abstract reasoning. It is not necessary to roll a die or draw a card to compute
66 CHAPTER 3 SOME BASIC PROBABILITY CONCEPTS
3GC03 11/07/2012 22:6:32 Page 67
these probabilities. In the rolling of the die, we say that each of the six sides is equally likely
to be observed if there is no reason to favor any one of the six sides. Similarly, if there is no
reason to favor the drawing of a particular card froma deck of cards, we say that each of the
52 cards is equally likely to be drawn. We may define probability in the classical sense
as follows:
DEFINITION
If an event can occur in N mutually exclusive and equally likely ways,
and if m of these possess a trait E, the probability of the occurrence of E
is equal to m=N.
If we read P E ( ) as “the probability of E,” we may express this definition as
P E ( ) =
m
N
(3.2.1)
Relative Frequency Probability The relative frequency approach to prob-
ability depends on the repeatability of some process and the ability to count the number
of repetitions, as well as the number of times that some event of interest occurs. In this
context we may define the probability of observing some characteristic, E, of an event
as follows:
DEFINITION
If some process is repeated a large number of times, n, and if some
resulting event with the characteristic E occurs m times, the relative
frequency of occurrence of E, m=n, will be approximately equal to the
probability of E.
To express this definition in compact form, we write
P E ( ) =
m
n
(3.2.2)
We must keep in mind, however, that, strictly speaking, m=n is only an estimate of P E ( ).
Subjective Probability In the early 1950s, L. J. Savage (4) gave considerable
impetus to what is called the “personalistic” or subjective concept of probability. This view
holds that probability measures the confidence that a particular individual has in the truth of
a particular proposition. This concept does not rely on the repeatability of any process. In
fact, by applying this concept of probability, one may evaluate the probability of an event
that can only happen once, for example, the probability that a cure for cancer will be
discovered within the next 10 years.
Although the subjective view of probability has enjoyed increased attention over the
years, it has not been fully accepted by statisticians who have traditional orientations.
3.2 TWO VIEWS OF PROBABILITY: OBJECTIVE AND SUBJECTIVE 67
3GC03 11/07/2012 22:6:32 Page 68
Bayesian Methods Bayesian methods are named in honor of the Reverend
Thomas Bayes (1702–1761), an English clergyman who had an interest in mathematics.
Bayesian methods are an example of subjective probability, since it takes into considera-
tion the degree of belief that one has in the chance that an event will occur. While
probabilities based on classical or relative frequency concepts are designed to allow for
decisions to be made solely on the basis of collected data, Bayesian methods make use of
what are known as prior probabilities and posterior probabilities.
DEFINITION
The prior probability of an event is a probability based on prior
knowledge, prior experience, or results derived from prior
data collection activity.
DEFINITION
The posterior probability of an event is a probability obtained by using
new information to update or revise a prior probability.
As more data are gathered, the more is likely to be known about the “true” probability of the
event under consideration. Although the idea of updating probabilities based on new
information is in direct contrast to the philosophy behind frequency-of-occurrence proba-
bility, Bayesian concepts are widely used. For example, Bayesian techniques have found
recent application in the construction of e-mail spam filters. Typically, the application of
Bayesian concepts makes use of a mathematical formula called Bayes’ theorem. In Section
3.5 we employ Bayes’ theorem in the evaluation of diagnostic screening test data.
3.3 ELEMENTARY PROPERTIES
OF PROBABILITY
In 1933 the axiomatic approach to probability was formalized by the Russian mathemati-
cian A. N. Kolmogorov (5). The basis of this approach is embodied in three properties from
which a whole system of probability theory is constructed through the use of mathematical
logic. The three properties are as follows.
1. Given some process (or experiment) with n mutually exclusive outcomes (called
events), E
1
; E
2
; . . . ; E
n
, the probability of any event E
i
is assigned a nonnegative
number. That is,
P E
i
( ) _ 0 (3.3.1)
In other words, all events must have a probability greater than or equal to zero,
a reasonable requirement in view of the difficulty of conceiving of negative prob-
ability. A key concept in the statement of this property is the concept of mutually
exclusive outcomes. Two events are said to be mutually exclusive if they cannot occur
simultaneously.
68 CHAPTER 3 SOME BASIC PROBABILITY CONCEPTS
3GC03 11/07/2012 22:6:32 Page 69
2. The sum of the probabilities of the mutually exclusive outcomes is equal to 1.
P E
1
( ) ÷P E
2
( ) ÷ ÷P E
n
( ) = 1 (3.3.2)
This is the property of exhaustiveness and refers to the fact that the observer of
a probabilistic process must allow for all possible events, and when all are taken
together, their total probability is 1. The requirement that the events be mutually
exclusive is specifying that the events E
1
; E
2
; . . . ; E
n
do not overlap; that is, no two of
them can occur at the same time.
3. Consider any two mutually exclusive events, E
i
and E
j
. The probability of the
occurrence of either E
i
or E
j
is equal to the sum of their individual probabilities.
P E
i
÷E
j
À Á
= P E
i
( ) ÷P E
j
À Á
(3.3.3)
Suppose the two events were not mutually exclusive; that is, suppose they could
occur at the same time. In attempting to compute the probability of the occurrence of either
E
i
or E
j
the problem of overlapping would be discovered, and the procedure could become
quite complicated. This concept will be discusses further in the next section.
3.4 CALCULATINGTHE PROBABILITY
OF ANEVENT
We nowmake use of the concepts and techniques of the previous sections in calculating the
probabilities of specific events. Additional ideas will be introduced as needed.
EXAMPLE 3.4.1
The primary aim of a study by Carter et al. (A-1) was to investigate the effect of the age at
onset of bipolar disorder on the course of the illness. One of the variables investigated was
family history of mood disorders. Table 3.4.1 shows the frequency of a family history of
TABLE 3.4.1 Frequency of Family History of Mood Disorder by
Age Group among Bipolar Subjects
Family History of Mood Disorders Early = 18(E) Later > 18(L) Total
Negative (A) 28 35 63
Bipolar disorder (B) 19 38 57
Unipolar (C) 41 44 85
Unipolar and bipolar (D) 53 60 113
Total 141 177 318
Source: Tasha D. Carter, Emanuela Mundo, Sagar V. Parkh, and James L. Kennedy,
“Early Age at Onset as a Risk Factor for Poor Outcome of Bipolar Disorder,” Journal of
Psychiatric Research, 37 (2003), 297–303.
3.4 CALCULATING THE PROBABILITY OF AN EVENT 69
3GC03 11/07/2012 22:6:32 Page 70
mood disorders in the two groups of interest (Early age at onset defined to be 18 years or
younger and Later age at onset defined to be later than 18 years). Suppose we pick a person
at random from this sample. What is the probability that this person will be 18 years old
or younger?
Solution: For purposes of illustrating the calculation of probabilities we consider this
group of 318 subjects to be the largest group for which we have an interest. In
other words, for this example, we consider the 318 subjects as a population.
We assume that Early and Later are mutually exclusive categories and that the
likelihood of selecting any one person is equal to the likelihood of selecting
any other person. We define the desired probability as the number of subjects
with the characteristic of interest (Early) divided by the total number of
subjects. We may write the result in probability notation as follows:
P(E) = number of Early subjects=total number of subjects
= 141=318 = :4434 &
Conditional Probability On occasion, the set of “all possible outcomes” may
constitute a subset of the total group. In other words, the size of the group of interest may be
reduced by conditions not applicable to the total group. When probabilities are calculated
with a subset of the total group as the denominator, the result is a conditional probability.
The probability computed in Example 3.4.1, for example, may be thought of as an
unconditional probability, since the size of the total group served as the denominator. No
conditions were imposed to restrict the size of the denominator. We may also think of this
probability as a marginal probability since one of the marginal totals was used as the
numerator.
We may illustrate the concept of conditional probability by referring again to
Table 3.4.1.
EXAMPLE 3.4.2
Suppose we pick a subject at random from the 318 subjects and find that he is 18 years or
younger (E). What is the probability that this subject will be one who has no family history
of mood disorders (A)?
Solution: The total number of subjects is no longer of interest, since, with the selection
of an Early subject, the Later subjects are eliminated. We may define the
desired probability, then, as follows: What is the probability that a subject has
no family history of mood disorders (A), given that the selected subject is
Early (E)? This is a conditional probability and is written as P(A[ E) in which
the vertical line is read “given.” The 141 Early subjects become the
denominator of this conditional probability, and 28, the number of Early
subjects with no family history of mood disorders, becomes the numerator.
Our desired probability, then, is
P(A[ E) = 28=141 = :1986
&
70 CHAPTER 3 SOME BASIC PROBABILITY CONCEPTS
3GC03 11/07/2012 22:6:33 Page 71
Joint Probability Sometimes we want to find the probability that a subject picked
at random from a group of subjects possesses two characteristics at the same time. Such a
probability is referred to as a joint probability. We illustrate the calculation of a joint
probability with the following example.
EXAMPLE 3.4.3
Let us refer again to Table 3.4.1. What is the probability that a person picked at random
from the 318 subjects will be Early (E) and will be a person who has no family history of
mood disorders (A)?
Solution: The probability we are seeking may be written in symbolic notation as
P(E ¨ A) in which the symbol ¨ is read either as “intersection” or “and.” The
statement E ¨ A indicates the joint occurrence of conditions E and A. The
number of subjects satisfying both of the desired conditions is found in
Table 3.4.1 at the intersection of the column labeled E and the row labeled A
and is seen to be 28. Since the selection will be made from the total set of
subjects, the denominator is 318. Thus, we may write the joint probability as
P(E ¨ A) = 28=318 = :0881
&
The Multiplication Rule A probability may be computed from other probabili-
ties. For example, a joint probability may be computed as the product of an appropriate
marginal probability and an appropriate conditional probability. This relationship is known
as the multiplication rule of probability. We illustrate with the following example.
EXAMPLE 3.4.4
We wish to compute the joint probability of Early age at onset (E) and a negative family
history of mood disorders (A) from a knowledge of an appropriate marginal probability and
an appropriate conditional probability.
Solution: The probability we seek is P(E ¨ A). We have already computed a marginal
probability, P(E) = 141=318 = :4434, and a conditional probability,
P(A[E) = 28=141 = :1986. It so happens that these are appropriate marginal
and conditional probabilities for computing the desired joint probability. We
may now compute P(E ¨ A) = P(E)P(A[ E) = (:4434)(:1986) = :0881.
This, wenote, is, asexpected, thesameresult weobtainedearlier for P(E ¨ A).&
We may state the multiplication rule in general terms as follows: For any two events
A and B,
P A ¨ B ( ) = P B ( )P A[ B ( ); if P B ( ) ,= 0 (3.4.1)
For the same two events A and B, the multiplication rule may also be written as
P A ¨ B ( ) = P A ( )P B[ A ( ); if P A ( ) ,= 0.
We see that through algebraic manipulation the multiplication rule as stated in
Equation 3.4.1 may be used to find any one of the three probabilities in its statement if the
other two are known. We may, for example, find the conditional probability P A[ B ( ) by
3.4 CALCULATING THE PROBABILITY OF AN EVENT 71
3GC03 11/07/2012 22:6:33 Page 72
dividing P A ¨ B ( ) by P B ( ). This relationship allows us to formally define conditional
probability as follows.
DEFINITION
The conditional probability of A given B is equal to the probability of
A ¨ B divided by the probability of B, provided the probability of B
is not zero.
That is,
P A[ B ( ) =
P A ¨ B ( )
P B ( )
; P B ( ) ,= 0 (3.4.2)
We illustrate the use of the multiplication rule to compute a conditional probability with the
following example.
EXAMPLE 3.4.5
We wish to use Equation 3.4.2 and the data in Table 3.4.1 to find the conditional probability,
P(A[ E)
Solution: According to Equation 3.4.2,
P(A[ E) = P(A ¨ E)=P(E)
&
Earlier we found P E ¨ A ( ) = P A ¨ E ( ) = 28=318 = :0881. We have also determined that
P E ( ) = 141=318 = :4434. Using these results we are able to compute P A[ E ( ) =
:0881=:4434 = :1987, which, as expected, is the same result we obtained by using the
frequencies directly from Table 3.4.1. (The slight discrepancy is due to rounding.)
The Addition Rule The third property of probability given previously states that
the probability of the occurrence of either one or the other of two mutually exclusive events
is equal to the sum of their individual probabilities. Suppose, for example, that we pick a
person at random from the 318 represented in Table 3.4.1. What is the probability that this
person will be Early age at onset E ( ) or Later age at onset L ( )? We state this probability
in symbols as P E L ( ), where the symbol is read either as “union” or “or.” Since the
two age conditions are mutually exclusive, P E ¨ L ( ) = 141=318 ( ) ÷ 177=318 ( ) =
:4434 ÷:5566 = 1.
What if two events are not mutually exclusive? This case is covered by what is known
as the addition rule, which may be stated as follows:
DEFINITION
Given two events A and B, the probability that event A, or event B, or
both occur is equal to the probability that event A occurs, plus the
probability that event B occurs, minus the probability that the events
occur simultaneously.
72 CHAPTER 3 SOME BASIC PROBABILITY CONCEPTS
3GC03 11/07/2012 22:6:34 Page 73
The addition rule may be written
P A B ( ) = P A ( ) ÷P B ( ) ÷P A ¨ B ( ) (3.4.3)
When events A and B cannot occur simultaneously, P A ¨ B ( ) is sometimes called
“exclusive or,” and P A B ( ) = 0. When events A and B can occur simultaneously,
P A B ( ) is sometimes called “inclusive or,” and we use the addition rule to calculate
P A B ( ). Let us illustrate the use of the addition rule by means of an example.
EXAMPLE 3.4.6
If we select a person at randomfromthe 318 subjects represented in Table 3.4.1, what is the
probability that this person will be an Early age of onset subject (E) or will have no family
history of mood disorders (A) or both?
Solution: The probability we seek is P(E A). By the addition rule as expressed
by Equation 3.4.3, this probability may be written as P(E A) =
P(E) ÷P(A) ÷P(E ¨ A). We have already found that P(E) = 141=318 =
:4434 and P(E ¨ A) = 28=318 = :0881. From the information in Table 3.4.1
we calculate P(A) = 63=318 = :1981. Substituting these results into the
equation for P(E A) we have P(E A) = :4434 ÷:1981 ÷:0881 =
:5534. &
Note that the 28 subjects who are both Early and have no family history of mood disorders
are included in the 141 who are Early as well as in the 63 who have no family history of
mood disorders. Since, in computing the probability, these 28 have been added into the
numerator twice, they have to be subtracted out once to overcome the effect of duplication,
or overlapping.
Independent Events Suppose that, in Equation 3.4.2, we are told that event B has
occurred, but that this fact has no effect on the probability of A. That is, suppose that the
probability of event A is the same regardless of whether or not B occurs. In this situation,
P A[ B ( ) = P A ( ). In such cases we say that A and B are independent events. The
multiplication rule for two independent events, then, may be written as
P A ¨ B ( ) = P A ( )P B ( ); P A ( ) ,= 0; P B ( ) ,= 0 (3.4.4)
Thus, we see that if two events are independent, the probability of their joint
occurrence is equal to the product of the probabilities of their individual occurrences.
Note that when two events with nonzero probabilities are independent, each of the
following statements is true:
P A[ B ( ) = P A ( ); P B[A ( ) = P B ( ); P A ¨ B ( ) = P A ( )P B ( )
Two events are not independent unless all these statements are true. It is important to be
aware that the terms independent and mutually exclusive do not mean the same thing.
Let us illustrate the concept of independence by means of the following example.
3.4 CALCULATING THE PROBABILITY OF AN EVENT 73
3GC03 11/07/2012 22:6:34 Page 74
EXAMPLE 3.4.7
In a certain high school class, consisting of 60 girls and 40 boys, it is observed that 24 girls
and 16 boys wear eyeglasses. If a student is picked at random from this class, the
probability that the student wears eyeglasses, P(E), is 40=100, or .4.
(a) What is the probability that a student picked at random wears eyeglasses, given that
the student is a boy?
Solution: By using the formula for computing a conditional probability, we find this
to be
P(E [ B) =
P(E ¨ B)
P(B)
=
16=100
40=100
= :4
Thus the additional information that a student is a boy does not alter the
probability that the student wears eyeglasses, and P(E) = P(E [ B). We say
that the events being a boy and wearing eyeglasses for this group are
independent. We may also show that the event of wearing eyeglasses, E,
and not being a boy,

B are also independent as follows:
P(E [

B) =
P(E ¨

B)
P(

B)
=
24=100
60=100
=
24
60
= :4
(b) What is the probability of the joint occurrence of the events of wearing eyeglasses
and being a boy?
Solution: Using the rule given in Equation 3.4.1, we have
P(E ¨ B) = P(B)P(E [ B)
but, since we have shown that events E and B are independent we may replace
P(E [ B) by P(E) to obtain, by Equation 3.4.4,
P(E ¨ B) = P(B)P(E)
=
40
100

40
100

= :16
&
Complementary Events Earlier, using the data in Table 3.4.1, we computed the
probability that a person picked at random from the 318 subjects will be an Early age of
onset subject as P E ( ) = 141=318 = :4434. We found the probability of a Later age at onset
to be P L ( ) = 177=318 = :5566. The sum of these two probabilities we found to be equal
to 1. This is true because the events being Early age at onset and being Later age at onset are
complementary events. In general, we may make the following statement about comple-
mentary events. The probability of an event A is equal to 1 minus the probability of its
74 CHAPTER 3 SOME BASIC PROBABILITY CONCEPTS
3GC03 11/07/2012 22:6:34 Page 75
complement, which is written

A and
P

A ( ) = 1 ÷P A ( ) (3.4.5)
This follows from the third property of probability since the event, A, and its
complement,

A are mutually exclusive.
EXAMPLE 3.4.8
Suppose that of 1200 admissions to a general hospital during a certain period of time, 750
are private admissions. If we designate these as set A, then

A is equal to 1200 minus 750, or
450. We may compute
P(A) = 750=1200 = :625
and
P(

A) = 450=1200 = :375
and see that
P(

A) = 1 ÷P(A)
:375 = 1 ÷:625
:375 = :375
&
Marginal Probability Earlier we used the term marginal probability to refer
to a probability in which the numerator of the probability is a marginal total from a table
such as Table 3.4.1. For example, when we compute the probability that a person picked
at random from the 318 persons represented in Table 3.4.1 is an Early age of onset
subject, the numerator of the probability is the total number of Early subjects, 141. Thus,
P E ( ) = 141=318 = :4434. We may define marginal probability more generally as follows:
DEFINITION
Given some variable that can be broken down into m categories
designated by A
1
; A
2
; . . . ; A
i
; . . . ; A
m
and another jointly occurring
variable that is broken down into n categories designated by B
1
;
B
2
; . . . ; B
j
; . . . ; B
n
, the marginal probability of A
i
; P A
i
( ), is equal to the
sum of the joint probabilities of A
i
with all the categories of B. That is,
P A
i
( ) = SP A
i
¨ B
j
À Á
; for all values of j (3.4.6)
The following example illustrates the use of Equation 3.4.6 in the calculation of a marginal
probability.
EXAMPLE 3.4.9
We wish to use Equation 3.4.6 and the data in Table 3.4.1 to compute the marginal
probability P(E).
3.4 CALCULATING THE PROBABILITY OF AN EVENT 75
3GC03 11/07/2012 22:6:35 Page 76
Solution: The variable age at onset is broken down into two categories, Early for onset
18 years or younger (E) and Later for onset occurring at an age over 18 years
(L). The variable family history of mood disorders is broken down into four
categories: negative family history (A), bipolar disorder only (B), unipolar
disorder only (C), and subjects with a history of both unipolar and bipolar
disorder (D). The category Early occurs jointly with all four categories of the
variable family history of mood disorders. The four joint probabilities that
may be computed are
P E ¨ A ( ) = 28=318 = :0881
P E ¨ B ( ) = 19=318 = :0597
P E ¨ C ( ) = 41=318 = :1289
P E ¨ D ( ) = 53=318 = :1667
We obtain the marginal probability P(E) by adding these four joint probabili-
ties as follows:
P E ( ) = P E ¨ A ( ) ÷P E ¨ B ( ) ÷P E ¨ C ( ) ÷P E ¨ D ( )
= :0881 ÷:0597 ÷:1289 ÷:1667
= :4434 &
The result, as expected, is the same as the one obtained by using the marginal total for
Early as the numerator and the total number of subjects as the denominator.
EXERCISES
3.4.1 In a study of violent victimization of women and men, Porcerelli et al. (A-2) collected information
from 679 women and 345 men aged 18 to 64 years at several family practice centers in the
metropolitan Detroit area. Patients filled out a health history questionnaire that included a question
about victimization. The following table shows the sample subjects cross-classified by sex and the
type of violent victimization reported. The victimization categories are defined as no victimization,
partner victimization (and not by others), victimization by persons other than partners (friends,
family members, or strangers), and those who reported multiple victimization.
No Victimization Partners Nonpartners Multiple Victimization Total
Women 611 34 16 18 679
Men 308 10 17 10 345
Total 919 44 33 28 1024
Source: Data provided courtesy of John H. Porcerelli, Ph.D., Rosemary Cogan, Ph.D.
(a) Suppose we pick a subject at random from this group. What is the probability that this subject
will be a woman?
(b) What do we call the probability calculated in part a?
(c) Show how to calculate the probability asked for in part a by two additional methods.
76 CHAPTER 3 SOME BASIC PROBABILITY CONCEPTS
3GC03 11/07/2012 22:6:35 Page 77
(d) If we pick a subject at random, what is the probability that the subject will be a woman and have
experienced partner abuse?
(e) What do we call the probability calculated in part d?
(f) Suppose we picked a man at random. Knowing this information, what is the probability that he
experienced abuse from nonpartners?
(g) What do we call the probability calculated in part f?
(h) Suppose we pick a subject at random. What is the probability that it is a man or someone who
experienced abuse from a partner?
(i) What do we call the method by which you obtained the probability in part h?
3.4.2 Fernando et al. (A-3) studied drug-sharing among injection drug users in the South Bronx in New
York City. Drug users in New York City use the term “split a bag” or “get down on a bag” to refer to
the practice of dividing a bag of heroin or other injectable substances. A common practice includes
splitting drugs after they are dissolved in a common cooker, a procedure with considerable HIV risk.
Although this practice is common, little is known about the prevalence of such practices. The
researchers asked injection drug users in four neighborhoods in the South Bronx if they ever
“got down on” drugs in bags or shots. The results classified by gender and splitting practice are
given below:
Gender Split Drugs Never Split Drugs Total
Male 349 324 673
Female 220 128 348
Total 569 452 1021
Source: Daniel Fernando, Robert F. Schilling, Jorge Fontdevila,
and Nabila El-Bassel, “Predictors of Sharing Drugs among
Injection Drug Users in the South Bronx: Implications for HIV
Transmission,” Journal of Psychoactive Drugs, 35 (2003), 227–236.
(a) How many marginal probabilities can be calculated from these data? State each in probability
notation and do the calculations.
(b) How many joint probabilities can be calculated? State each in probability notation and do the
calculations.
(c) How many conditional probabilities can be calculated? State each in probability notation and do
the calculations.
(d) Use the multiplication rule to find the probability that a person picked at random never split
drugs and is female.
(e) What do we call the probability calculated in part d?
(f) Use the multiplication rule to find the probability that a person picked at random is male, given
that he admits to splitting drugs.
(g) What do we call the probability calculated in part f?
3.4.3 Refer to the data in Exercise 3.4.2. State the following probabilities in words and calculate:
(a) P Male ¨ Split Drugs ( )
(b) P Male Split Drugs ( )
(c) P Male [ Split Drugs ( )
(d) P(Male)
EXERCISES 77
3GC03 11/07/2012 22:6:35 Page 78
3.4.4 Laveist and Nuru-Jeter (A-4) conducted a study to determine if doctor–patient race concordance was
associated with greater satisfaction with care. Toward that end, they collected a national sample of
African-American, Caucasian, Hispanic, and Asian-American respondents. The following table
classifies the race of the subjects as well as the race of their physician:
Patient’s Race
Physician’s Race Caucasian
African-
American Hispanic
Asian-
American Total
White 779 436 406 175 1796
African-American 14 162 15 5 196
Hispanic 19 17 128 2 166
Asian=Pacific-Islander 68 75 71 203 417
Other 30 55 56 4 145
Total 910 745 676 389 2720
Source: Thomas A. Laveist and Amani Nuru-Jeter, “Is Doctor–Patient Race Concordance Associated with Greater
Satisfaction with Care?” Journal of Health and Social Behavior, 43 (2002), 296–306.
(a) What is the probability that a randomly selected subject will have an Asian=Pacific-Islander
physician?
(b) What is the probability that an African-American subject will have an African-American
physician?
(c) What is the probability that a randomly selected subject in the study will be Asian-American and
have an Asian=Pacific-Islander physician?
(d) What is the probability that a subject chosen at random will be Hispanic or have a Hispanic
physician?
(e) Use the concept of complementary events to find the probability that a subject chosen at random
in the study does not have a white physician.
3.4.5 If the probability of left-handedness in a certain group of people is .05, what is the probability of
right-handedness (assuming no ambidexterity)?
3.4.6 The probability is .6 that a patient selected at random from the current residents of a certain hospital
will be a male. The probability that the patient will be a male who is in for surgery is .2. A patient
randomly selected fromcurrent residents is found to be a male; what is the probability that the patient
is in the hospital for surgery?
3.4.7 In a certain population of hospital patients the probability is .35 that a randomly selected patient will
have heart disease. The probability is .86 that a patient with heart disease is a smoker. What is the prob-
ability that a patient randomly selected from the population will be a smoker and have heart disease?
3.5 BAYES’ THEOREM, SCREENINGTESTS,
SENSITIVITY, SPECIFICITY, ANDPREDICTIVE
VALUE POSITIVE ANDNEGATIVE
In the health sciences field a widely used application of probability laws and concepts is
found in the evaluation of screening tests and diagnostic criteria. Of interest to clinicians is
an enhanced ability to correctly predict the presence or absence of a particular disease from
78 CHAPTER 3 SOME BASIC PROBABILITY CONCEPTS
3GC03 11/07/2012 22:6:35 Page 79
knowledge of test results (positive or negative) and=or the status of presenting symptoms
(present or absent). Also of interest is information regarding the likelihood of positive and
negative test results and the likelihood of the presence or absence of a particular symptom
in patients with and without a particular disease.
In our consideration of screening tests, we must be aware of the fact that they are not
always infallible. That is, a testing procedure may yield a false positive or a false negative.
DEFINITION
1. A false positive results when a test indicates a positive status when
the true status is negative.
2. A false negative results when a test indicates a negative status when
the true status is positive.
In summary, the following questions must be answered in order to evaluate the
usefulness of test results and symptom status in determining whether or not a subject has
some disease:
1. Given that a subject has the disease, what is the probability of a positive test result (or
the presence of a symptom)?
2. Given that a subject does not have the disease, what is the probability of a negative
test result (or the absence of a symptom)?
3. Given a positive screening test (or the presence of a symptom), what is the probability
that the subject has the disease?
4. Given a negative screening test result (or the absence of a symptom), what is the
probability that the subject does not have the disease?
Suppose we have for a sample of n subjects (where n is a large number) the
information shown in Table 3.5.1. The table shows for these n subjects their status with
regard to a disease and results from a screening test designed to identify subjects with the
disease. The cell entries represent the number of subjects falling into the categories defined
by the row and column headings. For example, a is the number of subjects who have the
disease and whose screening test result was positive.
As we have learned, a variety of probability estimates may be computed from the
information displayed in a two-way table such as Table 3.5.1. For example, we may
TABLE 3.5.1 Sample of n Subjects (Where n Is
Large) Cross-Classified According to Disease Status
and Screening Test Result
Disease
Test Result Present (D) Absent (

D) Total
Positive (T) a b a ÷b
Negative (

T) c d c ÷d
Total a ÷c b ÷d n
3.5 BAYES’ THEOREM, SCREENING TESTS, SENSITIVITY, SPECIFICITY 79
3GC03 11/07/2012 22:6:36 Page 80
compute the conditional probability estimate P T [ D ( ) = a= a ÷c ( ). This ratio is an
estimate of the sensitivity of the screening test.
DEFINITION
The sensitivity of a test (or symptom) is the probability of a positive test
result (or presence of the symptom) given the presence of the disease.
We may also compute the conditional probability estimate P

T [

D ( ) = d= b ÷d ( ).
This ratio is an estimate of the specificity of the screening test.
DEFINITION
The specificity of a test (or symptom) is the probability of a negative test
result (or absence of the symptom) given the absence of the disease.
From the data in Table 3.5.1 we answer Question 3 by computing the conditional
probability estimate P D[ T ( ). This ratio is an estimate of a probability called the predictive
value positive of a screening test (or symptom).
DEFINITION
The predictive value positive of a screening test (or symptom) is the
probability that a subject has the disease given that the subject has a
positive screening test result (or has the symptom).
Similarly, the ratio P

D[

T ( ) is an estimate of the conditional probability that a subject
does not have the disease given that the subject has a negative screening test result (or does
not have the symptom). The probability estimated by this ratio is called the predictive value
negative of the screening test or symptom.
DEFINITION
The predictive value negative of a screening test (or symptom) is the
probability that a subject does not have the disease, given that the subject
has a negative screening test result (or does not have the symptom).
Estimates of the predictive value positive and predictive value negative of a test (or
symptom) may be obtained from knowledge of a test’s (or symptom’s) sensitivity and
specificity and the probability of the relevant disease in the general population. To obtain
these predictive value estimates, we make use of Bayes’s theorem. The following statement
of Bayes’s theorem, employing the notation established in Table 3.5.1, gives the predictive
value positive of a screening test (or symptom):
P D[ T ( ) =
P T [ D ( )P D ( )
P T [ D ( )P D ( ) ÷P T [

D ( )P

D ( )
(3.5.1)
80 CHAPTER 3 SOME BASIC PROBABILITY CONCEPTS
3GC03 11/07/2012 22:6:36 Page 81
It is instructive to examine the composition of Equation 3.5.1. We recall from
Equation 3.4.2 that the conditional probability P D[ T ( ) is equal to P D ¨ T ( )=P T ( ). To
understand the logic of Bayes’s theorem, we must recognize that the numerator of Equation
3.5.1 represents P D ¨ T ( ) and that the denominator represents P T ( ). We know from the
multiplication rule of probability given in Equation 3.4.1 that the numerator of Equation
3.5.1, P T [ D ( ) P D ( ), is equal to P D ¨ T ( ).
Now let us show that the denominator of Equation 3.5.1 is equal to P T ( ). We know
that event T is the result of a subject’s being classified as positive with respect to a
screening test (or classified as having the symptom). A subject classified as positive may
have the disease or may not have the disease. Therefore, the occurrence of T is the result
of a subject having the disease and being positive P D ¨ T ( ) [ [ or not having the disease
and being positive P

D ¨ T ( ) [ [. These two events are mutually exclusive (their intersec-
tion is zero), and consequently, by the addition rule given by Equation 3.4.3, we
may write
P T ( ) = P D ¨ T ( ) ÷P

D ¨ T ( ) (3.5.2)
Since, by the multiplication rule, P D ¨ T ( ) = P T [ D ( ) P D ( ) and P

D ¨ T ( ) =
P T [

D ( ) P

D ( ), we may rewrite Equation 3.5.2 as
P T ( ) = P T [ D ( )P D ( ) ÷P T [

D ( )P

D ( ) (3.5.3)
which is the denominator of Equation 3.5.1.
Note, also, that the numerator of Equation 3.5.1 is equal to the sensitivity times the
rate (prevalence) of the disease and the denominator is equal to the sensitivity times the rate
of the disease plus the term 1 minus the sensitivity times the term 1 minus the rate of the
disease. Thus, we see that the predictive value positive can be calculated from knowledge
of the sensitivity, specificity, and the rate of the disease.
Evaluation of Equation 3.5.1 answers Question 3. To answer Question 4 we
follow a now familiar line of reasoning to arrive at the following statement of Bayes’s
theorem:
P

D[

T ( ) =
P

T [

D ( )P

D ( )
P

T [

D ( )P

D ( ) ÷P

T [ D ( )P D ( )
(3.5.4)
Equation 3.5.4 allows us to compute an estimate of the probability that a subject who is
negative on the test (or has no symptom) does not have the disease, which is the predictive
value negative of a screening test or symptom.
We illustrate the use of Bayes’ theorem for calculating a predictive value positive
with the following example.
EXAMPLE 3.5.1
A medical research team wished to evaluate a proposed screening test for Alzheimer’s
disease. The test was given to a random sample of 450 patients with Alzheimer’s disease
and an independent random sample of 500 patients without symptoms of the disease.
3.5 BAYES’ THEOREM, SCREENING TESTS, SENSITIVITY, SPECIFICITY 81
3GC03 11/07/2012 22:6:36 Page 82
The two samples were drawn from populations of subjects who were 65 years of age or
older. The results are as follows:
Alzheimer’s Diagnosis?
Test Result Yes (D) No (

D) Total
Positive (T) 436 5 441
Negative (

T) 14 495 509
Total 450 500 950
Using these data we estimate the sensitivity of the test to be P(T [ D) = 436=450 = :97. The
specificity of the test is estimated to be P(

T [

D) = 495=500 = :99. We nowuse the results of
the study to compute the predictive value positive of the test. That is, we wish to estimate the
probability that a subject who is positive on the test has Alzheimer’s disease. From the
tabulated data we compute P(T [ D) = 436=450 = :9689 and P(T [

D) = 5=500 = :01.
Substitution of these results into Equation 3.5.1 gives
P(D[ T) =
(:9689)P(D)
(:9689)P(D) ÷(:01)P(

D)
(3.5.5)
We see that the predictive value positive of the test depends on the rate of the disease in the
relevant population in general. In this case the relevant population consists of subjects who
are 65 years of age or older. We emphasize that the rate of disease in the relevant general
population, P(D), cannot be computed fromthe sample data, since two independent samples
were drawnfromtwodifferent populations. We must lookelsewhere for an estimate of P(D).
Evans et al. (A-5) estimated that 11.3 percent of the U.S. population aged 65 and over have
Alzheimer’s disease. When we substitute this estimate of P(D) into Equation 3.5.5 we
obtain
P(D[ T) =
(:9689)(:113)
(:9689)(:113) ÷(:01)(1 ÷:113)
= :93
As we see, in this case, the predictive value of the test is very high.
Similarly, let us now consider the predictive value negative of the test. We have
already calculated all entries necessary except for P(

T [ D) = 14=450 = :0311. Using the
values previously obtained and our new value, we find
P(

D[ T) =
(:99)(1 ÷:113)
(:99)(1 ÷:113) ÷(:0311)(:113)
= :996
As we see, the predictive value negative is also quite high. &
82 CHAPTER 3 SOME BASIC PROBABILITY CONCEPTS
3GC03 11/07/2012 22:6:37 Page 83
EXERCISES
3.5.1 A medical research team wishes to assess the usefulness of a certain symptom (call it S) in the
diagnosis of a particular disease. In a random sample of 775 patients with the disease, 744 reported
having the symptom. In an independent random sample of 1380 subjects without the disease, 21
reported that they had the symptom.
(a) In the context of this exercise, what is a false positive?
(b) What is a false negative?
(c) Compute the sensitivity of the symptom.
(d) Compute the specificity of the symptom.
(e) Suppose it is known that the rate of the disease in the general population is. 001. What is the
predictive value positive of the symptom?
(f) What is the predictive value negative of the symptom?
(g) Find the predictive value positive and the predictive value negative for the symptom for the
following hypothetical disease rates: .0001, .01, and .10.
(h) What do you conclude about the predictive value of the symptom on the basis of the results
obtained in part g?
3.5.2 In an article entitled “Bucket-Handle Meniscal Tears of the Knee: Sensitivity and Specificity of MRI
signs,” Dorsay and Helms (A-6) performed a retrospective study of 71 knees scanned by MRI. One of
the indicators they examined was the absence of the “bow-tie sign” in the MRI as evidence of a
bucket-handle or “bucket-handle type” tear of the meniscus. In the study, surgery confirmed that 43 of
the 71 cases were bucket-handle tears. The cases may be cross-classified by “bow-tie sign” status and
surgical results as follows:
Tear Surgically
Confirmed (D)
Tear Surgically Confirmed As
Not Present

D ( ) Total
Positive Test
(absent bow-tie sign) (T)
38 10 48
Negative Test
(bow-tie sign present)

T ( )
5 18 23
Total 43 28 71
Source: Theodore A. Dorsay and Clyde A. Helms, “Bucket-handle Meniscal Tears of the Knee: Sensitivity
and Specificity of MRI Signs,” Skeletal Radiology, 32 (2003), 266–272.
(a) What is the sensitivity of testing to see if the absent bow tie sign indicates a meniscal tear?
(b) What is the specificity of testing to see if the absent bow tie sign indicates a meniscal tear?
(c) What additional information would you need to determine the predictive value of the test?
3.5.3 Oexle et al. (A-7) calculated the negative predictive value of a test for carriers of X-linked ornithine
transcarbamylase deficiency (OTCD—a disorder of the urea cycle). A test known as the “allopurinol
test” is often used as a screening device of potential carriers whose relatives are OTCD patients. They
cited a study by Brusilow and Horwich (A-8) that estimated the sensitivity of the allopurinol test as
.927. Oexle et al. themselves estimated the specificity of the allopurinol test as .997. Also they
estimated the prevalence in the population of individuals with OTCD as 1=32000. Use this
information and Bayes’s theorem to calculate the predictive value negative of the allopurinol
screening test.
EXERCISES 83
3GC03 11/07/2012 22:6:37 Page 84
3.6 SUMMARY
In this chapter some of the basic ideas and concepts of probability were presented. The
objective has been to provide enough of a “feel” for the subject so that the probabilistic
aspects of statistical inference can be more readily understood and appreciated when this
topic is presented later.
We defined probability as a number between 0 and 1 that measures the likelihood of
the occurrence of some event. We distinguished between subjective probability and
objective probability. Objective probability can be categorized further as classical or
relative frequency probability. After stating the three properties of probability, we defined
and illustrated the calculation of the following kinds of probabilities: marginal, joint, and
conditional. We also learned how to apply the addition and multiplication rules to find
certain probabilities. We learned the meaning of independent, mutually exclusive, and
complementary events. We learned the meaning of specificity, sensitivity, predictive value
positive, and predictive value negative as applied to a screening test or disease symptom.
Finally, we learned how to use Bayes’s theorem to calculate the probability that a subject
has a disease, given that the subject has a positive screening test result (or has the symptom
of interest).
SUMMARY OF FORMULAS FOR CHAPTER 3
Formula number Name Formula
3.2.1 Classical probability
P E ( ) =
m
N
3.2.2 Relative frequency
probability
P E ( ) =
m
n
3.3.1–3.3.3 Properties of probability P E
i
( ) _ 0
P E
1
( ) ÷P E
2
( ) ÷ ÷P E
n
( ) = 1
P E
i
÷E
j
À Á
= P E
i
( ) ÷P E
j
À Á
3.4.1 Multiplication rule P(A ¨ B) = P(B)P(A[ B) = P(A)P(B[ A)
3.4.2 Conditional probability
P(A[ B) =
P(A ¨ B)
P(B)
3.4.3 Addition rule P(A B) = P(A) ÷P(B) ÷P(A ¨ B)
3.4.4 Independent events P(A ¨ B) = P(A)P(B)
3.4.5 Complementary events P(

A) = 1 ÷P(A)
3.4.6 Marginal probability P(A
i
) =
P
P(A
i
¨ B
j
)
Sensitivity of a screening test
P(T [ D) =
a
(a ÷c)
Specificity of a screening test
P(

T [

D) =
d
(b ÷d)
84 CHAPTER 3 SOME BASIC PROBABILITY CONCEPTS
3GC03 11/07/2012 22:6:38 Page 85
3.5.1 Predictive value positive of a
screening test
P D[ T ( ) =
P T [ D ( )P D ( )
P T [ D ( )P D ( ) ÷P T [

D ( )P

D ( )
3.5.2 Predictive value negative of a
screening test
P

D[

T ( ) =
P

T [

D ( )P

D ( )
P

T [

D ( )P

D ( ) ÷P

T [ D ( )P D ( )
Symbol Key
v
D = disease
v
E = Event
v
m = the number of times an event E
i
occurs
v
n = sample size or the total number of times a process occurs
v
N = Population size or the total number of mutually exclusive and
equally likely events
v
P(

A) = a complementary event; the probability of an event A, not
occurring
v
P(E
i
) = probability of some event E
i
occurring
v
P(A ¨ B) = an “intersection” or “and” statement; the probability of
an event A and an event B occurring
v
P(A B) = an “union” or “or” statement; the probability of an event
A or an event B or both occurring
v
P(A[ B) = a conditional statement; the probability of an event A
occurring given that an event B has already occurred
v
T = test results
REVIEWQUESTIONS ANDEXERCISES
1. Define the following:
(a) Probability (b) Objective probability
(c) Subjective probability (d) Classical probability
(e) The relative frequency concept of probability (f) Mutually exclusive events
(g) Independence (h) Marginal probability
(i) Joint probability (j) Conditional probability
(k) The addition rule (l) The multiplication rule
(m) Complementary events (n) False positive
(o) False negative (p) Sensitivity
(q) Specificity (r) Predictive value positive
(s) Predictive value negative (t) Bayes’s theorem
2. Name and explain the three properties of probability.
3. Coughlin et al. (A-9) examined the breast and cervical screening practices of Hispanic and non-
Hispanic women in counties that approximate the U.S. southern border region. The study used data
from the Behavioral Risk Factor Surveillance System surveys of adults age 18 years or older
conducted in 1999 and 2000. The table below reports the number of observations of Hispanic and
non-Hispanic women who had received a mammogram in the past 2 years cross-classified with
marital status.
REVIEWQUESTIONS AND EXERCISES 85
3GC03 11/07/2012 22:6:38 Page 86
Marital Status Hispanic Non-Hispanic Total
Currently Married 319 738 1057
Divorced or Separated 130 329 459
Widowed 88 402 490
Never Married or Living As
an Unmarried Couple
41 95 136
Total 578 1564 2142
Source: Steven S. Coughlin, Robert J. Uhler, Thomas Richards, and Katherine
M. Wilson, “Breast and Cervical Cancer Screening Practices Among Hispanic
and Non-Hispanic Women Residing Near the United States–Mexico Border,
1999–2000,” Family and Community Health, 26 (2003), 130–139.
(a) We select at random a subject who had a mammogram. What is the probability that she is
divorced or separated?
(b) We select at random a subject who had a mammogram and learn that she is Hispanic. With that
information, what is the probability that she is married?
(c) We select at random a subject who had a mammogram. What is the probability that she is non-
Hispanic and divorced or separated?
(d) We select at random a subject who had a mammogram. What is the probability that she is
Hispanic or she is widowed?
(e) We select at random a subject who had a mammogram. What is the probability that she is not
married?
4. Swor et al. (A-10) looked at the effectiveness of cardiopulmonary resuscitation (CPR) training in
people over 55 years old. They compared the skill retention rates of subjects in this age group who
completed a course in traditional CPR instruction with those who received chest-compression only
cardiopulmonary resuscitation (CC-CPR). Independent groups were tested 3 months after training.
The table below shows the skill retention numbers in regard to overall competence as assessed by
video ratings done by two video evaluators.
Rated Overall
Competent CPR CC-CPR Total
Yes 12 15 27
No 15 14 29
Total 27 29 56
Source: Robert Swor, Scott Compton, Fern Vining, Lynn Ososky
Farr, Sue Kokko, Rebecca Pascual, and Raymond E. Jackson,
“A Randomized Controlled Trial of Chest Compression Only
CPR for Older Adults—a Pilot Study,” Resuscitation, 58 (2003),
177–185.
(a) Find the following probabilities and explain their meaning:
1. A randomly selected subject was enrolled in the CC-CPR class.
2. A randomly selected subject was rated competent.
3. A randomly selected subject was rated competent and was enrolled in the CPR course.
4. A randomly selected subject was rated competent or was enrolled in CC-CPR.
5. A Randomly selected subject was rated competent given that they enrolled in the CC-CPR
course.
86 CHAPTER 3 SOME BASIC PROBABILITY CONCEPTS
3GC03 11/07/2012 22:6:38 Page 87
(b) We define the following events to be
A = a subject enrolled in the CPR course
B = a subject enrolled in the CC-CPR course
C = a subject was evaluated as competent
D = a subject was evaluated as not competent
Then explain why each of the following equations is or is not a true statement:
1. P A ¨ C ( ) = P C ¨ A ( ) 2. P A B ( ) = P B A ( )
3. P A ( ) = P A C ( ) ÷P A D ( ) 4. P B C ( ) = P B ( ) ÷P C ( )
5. P D[ A ( ) = P D ( ) 6. P C ¨ B ( ) = P C ( )P B ( )
7. P A ¨ B ( ) = 0 8. P C ¨ B ( ) = P B ( )P C[ B ( )
9. P A ¨ D ( ) = P A ( )P A[D ( )
5. Pillman et al. (A-11) studied patients with acute brief episodes of psychoses. The researchers
classified subjects into four personality types: obsessiod, asthenic=low self-confident, asthenic=high
self-confident, nervous=tense, and undeterminable. The table belowcross-classifies these personality
types with three groups of subjects—those with acute and transient psychotic disorders (ATPD),
those with “positive” schizophrenia (PS), and those with bipolar schizo-affective disorder (BSAD):
Personality Type ATPD (1) PS (2) BSAD (3) Total
Obsessoid (O) 9 2 6 17
Asthenic=low Self-confident (A) 20 17 15 52
Asthenic=high Self-confident (S) 5 3 8 16
Nervous=tense (N) 4 7 4 15
Undeterminable (U) 4 13 9 26
Total 42 42 42 126
Source: Frank Pillmann, Raffaela Bloink, Sabine Balzuweit, Annette Haring, and
Andreas Marneros, “Personality and Social Interactions in Patients with Acute Brief
Psychoses,” Journal of Nervous and Mental Disease, 191 (2003), 503–508.
Find the following probabilities if a subject in this study is chosen at random:
(a) P(O) (b) P A 2 ( ) (c) P(1) (d) P

A ( )
(e) P A[ 3 ( ) (f) P

3) ( (g) P 2 ¨ 3 ( ) (h) P 2 [ A ( )
6. Acertain county health department has received 25 applications for an opening that exists for a public
health nurse. Of these applicants 10 are over 30 and 15 are under 30. Seventeen hold bachelor’s
degrees only, and eight have master’s degrees. Of those under 30, six have master’s degrees. If a
selection from among these 25 applicants is made at random, what is the probability that a person
over 30 or a person with a master’s degree will be selected?
7. The following table shows 1000 nursing school applicants classified according to scores made on a
college entrance examination and the quality of the high school from which they graduated, as rated
by a group of educators:
Quality of High Schools
Score Poor (P) Average (A) Superior (S) Total
Low (L) 105 60 55 220
Medium (M) 70 175 145 390
High (H) 25 65 300 390
Total 200 300 500 1000
REVIEWQUESTIONS AND EXERCISES 87
3GC03 11/07/2012 22:6:39 Page 88
(a) Calculate the probability that an applicant picked at random from this group:
1. Made a low score on the examination.
2. Graduated from a superior high school.
3. Made a low score on the examination and graduated from a superior high school.
4. Made a low score on the examination given that he or she graduated from a superior high
school.
5. Made a high score or graduated from a superior high school.
(b) Calculate the following probabilities:
1. P(A) 2. P(H) 3. P(M)
4. P(A[ H) 5. P(M ¨ P) 6. (H [ S)
8. If the probability that a public health nurse will find a client at home is .7, what is the probability
(assuming independence) that on two home visits made in a day both clients will be home?
9. For a variety of reasons, self-reported disease outcomes are frequently used without verification in
epidemiologic research. In a study by Parikh-Patel et al. (A-12), researchers looked at the relationship
between self-reported cancer cases and actual cases. They used the self-reported cancer data from a
California Teachers Study and validated the cancer cases by using the California Cancer Registry
data. The following table reports their findings for breast cancer:
Cancer Reported (A) Cancer in Registry (B) Cancer Not in Registry Total
Yes 2991 2244 5235
No 112 115849 115961
Total 3103 118093 121196
Source: Arti Parikh-Patel, Mark Allen, WilliamE. Wright, and the California Teachers Study Steering Committee,
“Validation of Self-reported Cancers in the California Teachers Study,” American Journal of Epidemiology,
157 (2003), 539–545.
(a) Let A be the event of reporting breast cancer in the California Teachers Study. Find the
probability of A in this study.
(b) Let B be the event of having breast cancer confirmed in the California Cancer Registry. Find the
probability of B in this study.
(c) Find P(A ¨ B)
(d) Find A[ B ( )
(e) Find P(B[ A)
(f) Find the sensitivity of using self-reported breast cancer as a predictor of actual breast cancer in
the California registry.
(g) Find the specificity of using self-reported breast cancer as a predictor of actual breast cancer in
the California registry.
10. In a certain population the probability that a randomly selected subject will have been exposed to
a certain allergen and experience a reaction to the allergen is .60. The probability is .8 that a
subject exposed to the allergen will experience an allergic reaction. If a subject is selected at
random from this population, what is the probability that he or she will have been exposed to the
allergen?
11. Suppose that 3 percent of the people in a population of adults have attempted suicide. It is also known
that 20 percent of the population are living below the poverty level. If these two events are
88 CHAPTER 3 SOME BASIC PROBABILITY CONCEPTS
3GC03 11/07/2012 22:6:39 Page 89
independent, what is the probability that a person selected at random from the population will have
attempted suicide and be living below the poverty level?
12. In a certain population of women 4 percent have had breast cancer, 20 percent are smokers, and 3
percent are smokers and have had breast cancer. Awoman is selected at random from the population.
What is the probability that she has had breast cancer or smokes or both?
13. The probability that a person selected at random from a population will exhibit the classic symptom
of a certain disease is .2, and the probability that a person selected at random has the disease is .23.
The probability that a person who has the symptom also has the disease is .18. A person selected at
random from the population does not have the symptom. What is the probability that the person has
the disease?
14. For a certain population we define the following events for mother’s age at time of giving birth: A =
under 20 years; B =20–24 years; C =25–29 years; D =30–44 years. Are the events A, B, C, and D
pairwise mutually exclusive?
15. Refer to Exercise 14. State in words the event E = (A B).
16. Refer to Exercise 14. State in words the event F = (B C).
17. Refer to Exercise 14. Comment on the event G = (A ¨ B).
18. For a certain population we define the following events with respect to plasma lipoprotein levels
(mg=dl): A = (10–15); B = (_ 30); C = (_ 20). Are the events A and B mutually exclusive? A and
C? B and C? Explain your answer to each question.
19. Refer to Exercise 18. State in words the meaning of the following events:
(a) A B (b) A ¨ B (c) A ¨ C (d) A C
20. Refer to Exercise 18. State in words the meaning of the following events:
(a)

A (b)

B (c)

C
21. Rothenberg et al. (A-13) investigated the effectiveness of using the Hologic Sahara Sonometer, a
portable device that measures bone mineral density (BMD) in the ankle, in predicting a fracture. They
used a Hologic estimated bone mineral density value of .57 as a cutoff. The results of the
investigation yielded the following data:
Confirmed Fracture
Present (D) Not Present

D ( ) Total
BMD = :57(T) 214 670 884
BMD > :57(

T) 73 330 403
Total 287 1000 1287
Source: Data provided courtesy of Ralph J. Rothenberg, M.D., Joan
L. Boyd, Ph.D., and John P. Holcomb, Ph.D.
(a) Calculate the sensitivity of using a BMDvalue of .57 as a cutoff value for predicting fracture and
interpret your results.
(b) Calculate the specificity of using a BMDvalue of .57 as a cutoff value for predicting fracture and
interpret your results.
REVIEWQUESTIONS AND EXERCISES 89
3GC03 11/07/2012 22:6:39 Page 90
22. Verma et al. (A-14) examined the use of heparin-PF4 ELISA screening for heparin-induced
thrombocytopenia (HIT) in critically ill patients. Using C-serotonin release assay (SRA) as the
way of validating HIT, the authors found that in 31 patients tested negative by SRA, 22 also tested
negative by heparin-PF4 ELISA.
(a) Calculate the specificity of the heparin-PF4 ELISA testing for HIT.
(b) Using a “literature derived sensitivity” of 95 percent and a prior probability of HIToccurrence as
3.1 percent, find the positive predictive value.
(c) Using the same information as part (b), find the negative predictive value.
23. The sensitivity of a screening test is .95, and its specificity is .85. The rate of the disease for which the
test is used is .002. What is the predictive value positive of the test?
Exercises for Use with Large Data Sets Available on the Following Website:
www.wiley.com /college/daniel
Refer to the random sample of 800 subjects from the North Carolina birth registry we investigated in
the Chapter 2 review exercises.
1. Create a table that cross-tabulates the counts of mothers in the classifications of whether the baby
was premature or not (PREMIE) and whether the mother admitted to smoking during pregnancy
(SMOKE) or not.
(a) Find the probability that a mother in this sample admitted to smoking.
(b) Find the probability that a mother in this sample had a premature baby.
(c) Find the probability that a mother in the sample had a premature baby given that the mother
admitted to smoking.
(d) Find the probability that a mother in the sample had a premature baby given that the mother
did not admit to smoking.
(e) Find the probability that a mother in the sample had a premature baby or that the mother did
not admit to smoking.
2. Create a table that cross-tabulates the counts of each mother’s marital status (MARITAL) and
whether she had a low birth weight baby (LOW).
(a) Find the probability a mother selected at random in this sample had a low birth weight baby.
(b) Find the probability a mother selected at random in this sample was married.
(c) Find the probability a mother selected at random in this sample had a low birth weight child
given that she was married.
(d) Find the probability a mother selected at random in this sample had a low birth weight child
given that she was not married.
(e) Find the probability a mother selected at random in this sample had a low birth weight child
and the mother was married.
REFERENCES
Methodology References
1. ALLAN GUT, An Intermediate Course in Probability, Springer-Verlag, New York, 1995.
2. RICHARD ISAAC, The Pleasures of Probability, Springer-Verlag, New York, 1995.
3. HAROLD J. LARSON, Introduction to Probability, Addison-Wesley, Reading, MA, 1995.
4. L. J. SAVAGE, Foundations of Statistics, Second Revised Edition, Dover, New York, 1972.
5. A. N. KOLMOGOROV, Foundations of the Theory of Probability, Chelsea, New York, 1964 (Original German edition
published in 1933).
90 CHAPTER 3 SOME BASIC PROBABILITY CONCEPTS
3GC03 11/07/2012 22:6:39 Page 91
Applications References
A-1. TASHA D. CARTER, EMANUELA MUNDO, SAGARV. PARKH, and JAMES L. KENNEDY, “Early Age at Onset as a Risk Factor
for Poor Outcome of Bipolar Disorder,” Journal of Psychiatric Research, 37 (2003), 297–303.
A-2. JOHN H. PORCERELLI, ROSEMARY COGAN, PATRICIA P. WEST, EDWARD A. ROSE, DAWN LAMBRECHT, KAREN E. WILSON,
RICHARD K. SEVERSON, and DUNIA KARANA, “Violent Victimization of Women and Men: Physical and Psychiatric
Symptoms,” Journal of the American Board of Family Practice, 16 (2003), 32–39.
A-3. DANIEL FERNANDO, ROBERT F. SCHILLING, JORGE FONTDEVILA, and NABILA EL-BASSEL, “Predictors of Sharing Drugs
among Injection Drug Users in the South Bronx: Implications for HIV Transmission,” Journal of Psychoactive
Drugs, 35 (2003), 227–236.
A-4. THOMAS A. LAVEIST and AMANI NURU-JETER, “Is Doctor-patient Race Concordance Associated with Greater
Satisfaction with Care?” Journal of Health and Social Behavior, 43 (2002), 296–306.
A-5. D. A. EVANS, P. A. SCHERR, N. R. COOK, M. S. ALBERT, H. H. FUNKENSTEIN, L. A. SMITH, L. E. HEBERT, T. T. WETLE,
L. G. BRANCH, M. CHOWN, C. H. HENNEKENS, and J. O. TAYLOR, “Estimated Prevalence of Alzheimer’s Disease in
the United States,” Milbank Quarterly, 68 (1990), 267–289.
A-6. THEODORE A. DORSAY and CLYDE A. HELMS, “Bucket-handle Meniscal Tears of the Knee: Sensitivity and
Specificity of MRI Signs,” Skeletal Radiology, 32 (2003), 266–272.
A-7. KONRAD OEXLE, LUISA BONAFE, and BEAT STENMANN, “Remark on Utility and Error Rates of the Allopurinol Test
in Detecting Mild Ornithine Transcarbamylase Deficiency,” Molecular Genetics and Metabolism, 76 (2002),
71–75.
A-8. S. W. BRUSILOW, A.L. HORWICH, “Urea Cycle Enzymes,” in: C. R. SCRIVER, A. L. BEAUDET, W. S. SLY, D. VALLE
(Eds.), The Metabolic and Molecular Bases of Inherited Disease, 8th ed., McGraw-Hill, New York, 2001,
pp. 1909–1963.
A-9. STEVEN S. COUGHLIN, ROBERT J. UHLER, THOMAS RICHARDS, and KATHERINE M. WILSON, “Breast and Cervical Cancer
Screening Practices Among Hispanic and Non-Hispanic Women Residing Near the United States-Mexico
Border, 1999–2000,” Family and Community Health, 26 (2003), 130–139.
A-10. ROBERT SWOR, SCOTT COMPTON, FERN VINING, LYNN OSOSKY FARR, SUE KOKKO, REBECCA PASCUAL, and RAYMOND E.
JACKSON, “A Randomized Controlled Trial of Chest Compression Only CPR for Older Adults—a Pilot Study,”
Resuscitation, 58 (2003), 177–185.
A-11. FRANK PILLMANN, RAFFAELA BL~oINK, SABINE BALZUWEIT, ANNETTE HARING, and ANDREAS MARNEROS, “Personality
and Social Interactions in Patients with Acute Brief Psychoses,” The Journal of Nervous and Mental Disease, 191
(2003), 503–508.
A-12. ARTI PARIKH-PATEL, MARK ALLEN, WILLIAM E. WRIGHT, and the California Teachers Study Steering Committee,
“Validation of Self-reported Cancers in the California Teachers Study,” American Journal of Epidemiology, 157
(2003), 539–545.
A-13. RALPH J. ROTHENBERG, JOAN L. BOYD, and JOHN P. HOLCOMB, “Quantitative Ultrasound of the Calcaneus as a
Screening Tool to Detect Osteoporosis: Different Reference Ranges for Caucasian Women, African-American
Women, and Caucasian Men,” Journal of Clinical Densitometry, 7 (2004), 101–110.
A-14. ARUN K. VERMA, MARC LEVINE, STEPHEN J. CARTER, and JOHN G. KELTON, “Frequency of Herparin-Induced
Thrombocytopenia in Critical Care Patients,” Pharmacotheray, 23 (2003), 645–753.
REFERENCES 91
3GC04 11/24/2012 13:51:41 Page 92
CHAPTER 4
PROBABILITY DISTRIBUTIONS
CHAPTER OVERVIEW
Probability distributions of randomvariables assume powerful roles in statis-
tical analyses. Sincetheyshowall possiblevalues of arandomvariableandthe
probabilities associated with these values, probability distributions may be
summarized in ways that enable researchers to easily make objective deci-
sions based on samples drawn from the populations that the distributions
represent. This chapter introduces frequently used discrete and continuous
probability distributions that are used in later chapters to make statistical
inferences.
TOPICS
4.1 INTRODUCTION
4.2 PROBABILITY DISTRIBUTIONS OF DISCRETE VARIABLES
4.3 THE BINOMIAL DISTRIBUTION
4.4 THE POISSON DISTRIBUTION
4.5 CONTINUOUS PROBABILITY DISTRIBUTIONS
4.6 THE NORMAL DISTRIBUTION
4.7 NORMAL DISTRIBUTION APPLICATIONS
4.8 SUMMARY
LEARNING OUTCOMES
After studying this chapter, the student will
1. understand selected discrete distributions and how to use them to calculate
probabilities in real-world problems.
2. understand selected continuous distributions and how to use them to calculate
probabilities in real-world problems.
3. be able to explain the similarities and differences between distributions of the
discrete type and the continuous type and when the use of each is appropriate.
92
3GC04 11/24/2012 13:51:41 Page 93
4.1 INTRODUCTION
In the preceding chapter we introduced the basic concepts of probability as well as methods
for calculating the probability of an event. We build on these concepts in the present chapter
and explore ways of calculating the probability of an event under somewhat more complex
conditions. In this chapter we shall see that the relationship between the values of a random
variable and the probabilities of their occurrence may be summarized by means of a device
called a probability distribution. A probability distribution may be expressed in the form of
a table, graph, or formula. Knowledge of the probability distribution of a random variable
provides the clinician and researcher with a powerful tool for summarizing and describing
a set of data and for reaching conclusions about a population of data on the basis of a
sample of data drawn from the population.
4.2 PROBABILITY DISTRIBUTIONS
OF DISCRETE VARIABLES
Let us begin our discussion of probability distributions by considering the probability
distribution of a discrete variable, which we shall define as follows:
DEFINITION
The probability distribution of a discrete random variable is a table,
graph, formula, or other device used to specify all possible values of a
discrete random variable along with their respective probabilities.
If we let the discrete probability distribution be represented by p x ( ), then p x ( ) =
P X = x ( ) is the probability of the discrete random variable X to assume a value x.
EXAMPLE 4.2.1
In an article appearing in the Journal of the American Dietetic Association, Holben et al.
(A-1) looked at food security status in families in the Appalachian region of southern Ohio.
The purpose of the study was to examine hunger rates of families with children in a local
Head Start program in Athens, Ohio. The survey instrument included the 18-question U.S.
Household Food Security Survey Module for measuring hunger and food security. In
addition, participants were asked how many food assistance programs they had used in the
last 12 months. Table 4.2.1 shows the number of food assistance programs used by subjects
in this sample.
We wish to construct the probability distribution of the discrete variable X, where
X = number of food assistance programs used by the study subjects.
Solution: The values of X are x
1
= 1; x
2
= 2; . . . ; x
7
= 7, and x
8
= 8. We compute the
probabilities for these values by dividing their respective frequencies by
the total, 297. Thus, for example, p x
1
( ) = P X = x
1
( ) = 62=297 = :2088.
4.2 PROBABILITY DISTRIBUTIONS OF DISCRETE VARIABLES 93
3GC04 11/24/2012 13:51:42 Page 94
We display the results in Table 4.2.2, which is the desired probability
distribution. &
Alternatively, we can present this probability distribution in the form of a graph, as in
Figure 4.2.1. In Figure 4.2.1 the length of each vertical bar indicates the probability for the
corresponding value of x.
It will be observed in Table 4.2.2 that the values of p x ( ) = P X = x ( ) are all
positive, they are all less than 1, and their sum is equal to 1. These are not phenomena
peculiar to this particular example, but are characteristics of all probability distributions
of discrete variables. If x
1
; x
2
; x
3
; . . . ; x
k
are all possible values of the discrete random
TABLE 4.2.1 Number of Assistance
Programs Utilized by Families with
Children in Head Start Programs in
Southern Ohio
Number of Programs Frequency
1 62
2 47
3 39
4 39
5 58
6 37
7 4
8 11
Total 297
Source: Data provided courtesy of David H. Holben,
Ph.D. and John P. Holcomb, Ph.D.
TABLE 4.2.2 Probability Distribution
of Programs Utilized by Families
Among the Subjects Described in
Example 4.2.1
Number of Programs (x) P X = x ( )
1 .2088
2 .1582
3 .1313
4 .1313
5 .1953
6 .1246
7 .0135
8 .0370
Total 1.0000
94 CHAPTER 4 PROBABILITY DISTRIBUTIONS
3GC04 11/24/2012 13:51:42 Page 95
variable X, then we may then give the following two essential properties of a probability
distribution of a discrete variable:
(1) 0 _ P X = x ( ) _ 1
(2)
P
P X = x ( ) = 1; for all x
The reader will also note that each of the probabilities in Table 4.2.2 is the relative
frequency of occurrence of the corresponding value of X.
With its probability distribution available to us, we can make probability statements
regarding the random variable X. We illustrate with some examples.
EXAMPLE 4.2.2
What is the probability that a randomly selected family used three assistance programs?
Solution: We may write the desired probability as p 3 ( ) = P X = 3 ( ). We see in
Table 4.2.2 that the answer is .1313. &
EXAMPLE 4.2.3
What is the probability that a randomly selected family used either one or two programs?
Solution: To answer this question, we use the addition rule for mutually exclusive
events. Using probability notation and the results in Table 4.2.2, we write the
answer as P 1 2 ( ) = P 1 ( ) ÷P 2 ( ) = :2088 ÷:1582 = :3670: &
0.00
0.05
0.10
0.15
0.20
0.25
P
r
o
b
a
b
i
l
i
t
y
x (number of assistance programs)
1 2 3 4 5 6 7 8
FIGURE 4.2.1 Graphical representation of the probability
distribution shown in Table 4.2.1.
4.2 PROBABILITY DISTRIBUTIONS OF DISCRETE VARIABLES 95
3GC04 11/24/2012 13:51:42 Page 96
Cumulative Distributions Sometimes it will be more convenient to work with
the cumulative probability distribution of a random variable. The cumulative probability
distribution for the discrete variable whose probability distribution is given in Table 4.2.2
may be obtained by successively adding the probabilities, P X = x
i
( ), given in the last
column. The cumulative probability for x
i
is written as F x
i
( ) = P X _ x
i
( ). It gives the
probability that X is less than or equal to a specified value, x
i
.
The resulting cumulative probability distribution is shown in Table 4.2.3. The graph
of the cumulative probability distribution is shown in Figure 4.2.2. The graph of a
cumulative probability distribution is called an ogive. In Figure 4.2.2 the graph of F(x)
consists solely of the horizontal lines. The vertical lines only give the graph a connected
appearance. The length of each vertical line represents the same probability as that of the
corresponding line in Figure 4.2.1. For example, the length of the vertical line at X = 3
in Figure 4.2.2 represents the same probability as the length of the line erected at X = 3 in
Figure 4.2.1, or .1313 on the vertical scale.
TABLE 4.2.3 Cumulative Probability Distribution of
Number of Programs Utilized by Families Among the
Subjects Described in Example 4.2.1
Number of Programs (x) Cumulative Frequency P X _ x ( )
1 .2088
2 .3670
3 .4983
4 .6296
5 .8249
6 .9495
7 .9630
8 1.0000
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1 2 3 5 4 6 7 8
x (number of programs)
f

(
x
)
FIGURE 4.2.2 Cumulative probability distribution of number of assistance programs
among the subjects described in Example 4.2.1.
96 CHAPTER 4 PROBABILITY DISTRIBUTIONS
3GC04 11/24/2012 13:51:42 Page 97
By consulting the cumulative probability distribution we may answer quickly
questions like those in the following examples.
EXAMPLE 4.2.4
What is the probability that a family picked at random used two or fewer assistance
programs?
Solution: The probability we seek may be found directly in Table 4.2.3 by reading the
cumulative probability opposite x = 2, and we see that it is .3670. That is,
P X _ 2 ( ) = :3670. We also may find the answer by inspecting Figure 4.2.2
and determining the height of the graph (as measured on the vertical axis)
above the value X = 2. &
EXAMPLE 4.2.5
What is the probability that a randomly selected family used fewer than four programs?
Solution: Since a family that used fewer than four programs used either one, two, or
three programs, the answer is the cumulative probability for 3. That is,
P X < 4 ( ) = P X _ 3 ( ) = :4983. &
EXAMPLE 4.2.6
What is the probability that a randomly selected family used five or more programs?
Solution: To find the answer we make use of the concept of complementary probabili-
ties. The set of families that used five or more programs is the complement of
the set of families that used fewer than five (that is, four or fewer) programs.
The sum of the two probabilities associated with these sets is equal to 1. We
write this relationship in probability notation as P X _ 5 ( ) ÷P X _ 4 ( ) = 1:
Therefore, P X _ 5 ( ) = 1 ÷P X _ 4 ( ) = 1 ÷:6296 = :3704. &
EXAMPLE 4.2.7
What is the probability that a randomly selected family used between three and five
programs, inclusive?
Solution: P X _ 5 ( ) = :8249 is the probability that a family used between one and five
programs, inclusive. To get the probability of between three and five
programs, we subtract, from .8249, the probability of two or fewer. Using
probability notation we write the answer as P 3 _ X _ 5 ( ) = P X _ 5 ( ) ÷
P X _ 2 ( ) = :8249 ÷:3670 = :4579. &
The probability distribution given in Table 4.2.1 was developed out of actual experience, so
to find another variable following this distribution would be coincidental. The probability
4.2 PROBABILITY DISTRIBUTIONS OF DISCRETE VARIABLES 97
3GC04 11/24/2012 13:51:42 Page 98
distributions of many variables of interest, however, can be determined or assumed on the
basis of theoretical considerations. In later sections, we study in detail three of these
theoretical probability distributions: the binomial, the Poisson, and the normal.
Mean and Variance of Discrete Probability Distributions The
mean and variance of a discrete probability distribution can easily be found using the
formulae below.
m =
X
xp(x) (4.2.1)
s
2
=
X
(x ÷m)
2
p(x) =
X
x
2
p(x) ÷m
2
(4.2.2)
where p(x) is the relative frequency of a given random variable X. The standard deviation is
simply the positive square root of the variance.
EXAMPLE 4.2.8
What are the mean, variance, and standard deviation of the distribution fromExample 4.2.1?
Solution:
m = (1)(:2088) ÷(2)(:1582) ÷(3)(:1313) ÷ ÷(8)(:0370) = 3:5589
s
2
= (1 ÷3:5589)
2
(:2088) ÷(2 ÷3:5589)
2
(:1582) ÷(3 ÷3:5589)
2
(:1313)
÷ ÷(8 ÷3:5589)
2
(:0370) = 3:8559
We therefore can conclude that the mean number of programs utilized was 3.5589 with a
variance of 3.8559. The standard deviation is therefore
ffiffiffiffiffiffiffiffiffiffiffiffiffiffi
3:8559
_
= 1:9637 programs. &
EXERCISES
4.2.1. In a study by Cross et al. (A-2), patients who were involved in problem gambling treatment were
asked about co-occurring drug and alcohol addictions. Let the discrete random variable X represent
the number of co-occurring addictive substances used by the subjects. Table 4.2.4 summarizes the
frequency distribution for this random variable.
(a) Construct a table of the relative frequency and the cumulative frequency for this discrete
distribution.
(b) Construct a graph of the probability distribution and a graph representing the cumulative
probability distribution for these data.
4.2.2. Refer to Exercise 4.2.1.
(a) What is probability that an individual selected at random used five addictive substances?
(b) What is the probability that an individual selected at random used fewer than three addictive
substances?
(c) What is the probability that an individual selected at random used more than six addictive
substances?
(d) What is the probability that an individual selected at randomused between two and five addictive
substances, inclusive?
4.2.3. Refer to Exercise 4.2.1. Find the mean, variance, and standard deviation of this frequency distribution.
98 CHAPTER 4 PROBABILITY DISTRIBUTIONS
3GC04 11/24/2012 13:51:43 Page 99
4.3 THE BINOMIAL DISTRIBUTION
The binomial distribution is one of the most widely encountered probability distributions in
applied statistics. The distribution is derived from a process known as a Bernoulli trial,
named in honor of the Swiss mathematician James Bernoulli (1654–1705), who made
significant contributions in the field of probability, including, in particular, the binomial
distribution. When a random process or experiment, called a trial,

Top Articles
Latest Posts
Article information

Author: Dong Thiel

Last Updated: 02/10/2023

Views: 5956

Rating: 4.9 / 5 (79 voted)

Reviews: 86% of readers found this page helpful

Author information

Name: Dong Thiel

Birthday: 2001-07-14

Address: 2865 Kasha Unions, West Corrinne, AK 05708-1071

Phone: +3512198379449

Job: Design Planner

Hobby: Graffiti, Foreign language learning, Gambling, Metalworking, Rowing, Sculling, Sewing

Introduction: My name is Dong Thiel, I am a brainy, happy, tasty, lively, splendid, talented, cooperative person who loves writing and wants to share my knowledge and understanding with you.