3GFFIRS 11/28/2012 15:43:56 Page 2

3GFFIRS 11/28/2012 15:43:56 Page 1

T E NT H E DI T I ON

BIOSTATISTICS

A Foundation for Analysis

in the Health Sciences

3GFFIRS 11/28/2012 15:43:56 Page 2

3GFFIRS 11/28/2012 15:43:56 Page 3

T E NT H E DI T I ON

BIOSTATISTICS

A Foundation for Analysis

in the Health Sciences

WAYNE W. DANI EL, PH. D.

Professor Emeritus

Georgia State University

CHAD L. CROSS, PH. D. , PSTAT

R

Statistician

Office of Informatics and Analytics

Veterans Health Administration

Associate Graduate Faculty

University of Nevada, Las Vegas

3GFFIRS 11/28/2012 15:43:56 Page 4

This book was set in 10/12pt, Times Roman by Thomson Digital and printed and bound by Edwards Brothers Malloy.

The cover was printed by Edwards Brothers Malloy.

This book is printed on acid free paper. 1

Founded in 1807, John Wiley & Sons, Inc. has been a valued source of knowledge and understanding for more

than 200 years, helping people around the world meet their needs and fulfill their aspirations. Our company is

built on a foundation of principles that include responsibility to the communities we serve and where we live and

work. In 2008, we launched a Corporate Citizenship Initiative, a global effort to address the environmental,

social, economic, and ethical challenges we face in our business. Among the issues we are addressing are carbon

impact, paper specifications and procurement, ethical conduct within our business and among our vendors, and

community and charitable support. For more information, please visit our website: www.wiley.com/go/

citizenship.

Copyright #2013, 2009, 2005, 1999 John Wiley & Sons, Inc. All rights reserved. No part of this publication

may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic,

mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of

the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or

authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc. 222

Rosewood Drive, Danvers, MA 01923, website www.copyright.com. Requests to the Publisher for permission

should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ

07030-5774, (201)748-6011, fax (201)748-6008, website http://www.wiley.com/go/permissions.

Evaluation copies are provided to qualified academics and professionals for review purposes only, for use in their

courses during the next academic year. These copies are licensed and may not be sold or transferred to a third

party. Upon completion of the review period, please return the evaluation copy to Wiley. Return instructions and

a free of charge return mailing label are available at www.wiley.com/go/returnlabel. If you have chosen to adopt

this textbook for use in your course, please accept this book as your complimentary desk copy. Outside of the

United States, please contact your local sales representative.

Library of Congress Cataloging-in-Publication Data

Daniel, Wayne W., 1929-

Biostatistics : a foundation for analysis in the health sciences / Wayne W.

Daniel, Chad Lee Cross. — Tenth edition.

pages cm

Includes index.

ISBN 978-1-118-30279-8 (cloth)

1. Medical statistics. 2. Biometry. I. Cross, Chad Lee, 1971- II. Title.

RA409.D35 2013

610.72

0

7—dc23 2012038459

Printed in the United States of America

10 9 8 7 6 5 4 3 2 1

VP & EXECUTIVE PUBLISHER:

ACQUISITIONS EDITOR:

PROJECT EDITOR:

MARKETING MANAGER:

MARKETING ASSISTANT:

PHOTO EDITOR:

DESIGNER:

PRODUCTION MANAGEMENT SERVICES:

ASSOCIATE PRODUCTION MANAGER:

PRODUCTION EDITOR:

COVER PHOTO CREDIT:

Laurie Rosatone

Shannon Corliss

Ellen Keohane

Melanie Kurkjian

Patrick Flatley

Sheena Goldstein

Kenji Ngieng

Thomson Digital

Joyce Poh

Jolene Ling

#ktsimage/iStockphoto

3GFFIRS 11/28/2012 15:43:56 Page 5

Dr. Daniel

To my children, Jean, Carolyn,

and John, and to the memory of

their mother, my wife, Mary.

Dr. Cross

To my wife Pamela

and to my children, Annabella Grace

and Breanna Faith.

3GFFIRS 11/28/2012 15:43:56 Page 6

3GFPREF 11/08/2012 1:59:19 Page 7

PREFACE

This 10th edition of Biostatistics: A Foundation for Analysis in the Health Sciences was

prepared with the objective of appealing to a wide audience. Previous editions of the book

have been used by the authors and their colleagues in a variety of contexts. For under-

graduates, this edition should provide an introduction to statistical concepts for students in

the biosciences, health sciences, and for mathematics majors desiring exposure to applied

statistical concepts. Like its predecessors, this edition is designed to meet the needs of

beginning graduate students in various fields such as nursing, applied sciences, and public

health who are seeking a strong foundation in quantitative methods. For professionals

already working in the health field, this edition can serve as a useful desk reference.

The breadth of coverage provided in this text, along with the hundreds of practical

exercises, allow instructors extensive flexibility in designing courses at many levels. To

that end, we offer below some ideas on topical coverage that we have found to be useful in

the classroom setting.

Like the previous editions of this book, this edition requires few mathematical pre-

requisites beyond a solid proficiency in college algebra. We have maintained an emphasis

on practical and intuitive understanding of principles rather than on abstract concepts that

underlie some methods, and that require greater mathematical sophistication. With that in

mind, we have maintained a reliance on problem sets and examples taken directly from the

health sciences literature instead of contrived examples. We believe that this makes the text

more interesting for students, and more practical for practicing health professionals who

reference the text while performing their work duties.

For most of the examples and statistical techniques covered in this edition, we

discuss the use of computer software for calculations. Experience has informed our

decision to include example printouts from a variety of statistical software in this edition

(e.g., MINITAB, SAS, SPSS, and R). We feel that the inclusion of examples from these

particular packages, which are generally the most commonly utilized by practitioners,

provides a rich presentation of the material and allows the student the opportunity to

appreciate the various technologies used by practicing statisticians.

CHANGES ANDUPDATES TOTHIS EDITION

The majority of the chapters include corrections and clarifications that enhance the material

that is presented and make it more readable and accessible to the audience. We did,

however, make several specific changes and improvements that we believe are valuable

contributions to this edition, and we thank the reviewers of the previous edition for their

comments and suggestions in that regard.

vii

3GFPREF 11/08/2012 1:59:19 Page 8

Specific changes to this edition include additional text concerning measures of

dispersion in Chapter 2, additional text and examples using program R in Chapter 6, a new

introduction to linear models in Chapter 8 that ties together the regression and ANOVA

concepts in Chapters 8–11, the addition of two-factor repeated measures ANOVA in

Chapter 8, a discussion of the similarities of ANOVA and regression in Chapter 11,

and extensive new text and examples on testing the fit of logistic regression models in

Chapter 11.

Most important to this new edition is a new Chapter 14 on Survival Analysis. This

new chapter was borne out of requests from reviewers of the text and from the experience

of the authors in terms of the growing use of these methods in applied research. In this

new chapter, we included some of the material found in Chapter 12 in previous editions,

and added extensive material and examples. We provide introductory coverage of

censoring, Kaplan–Meier estimates, methods for comparing survival curves, and the

Cox Regression Proportional Hazards model. Owing to this new material, we elected

to move the contents of the vital statistics chapter to a new Chapter 15 and make it

avai labl e o nl ine (w ww. wi ley. com/colleg e/ daniel).

COURSE COVERAGE IDEAS

In the table below we provide some suggestions for topical coverage in a variety of

contexts, with “X” indicating those chapters we believe are most relevant for a variety of

courses for which this text is appropriate. The text has been designed to be flexible in order

to accommodate various teaching styles and various course presentations. Although the

text is designed with progressive presentation of concepts in mind, certain of the topics may

be skipped or covered briefly so that focus can be placed on concepts important to

instructors.

Course Chapters

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Undergraduate course for health

sciences students

X X X X X X X X X O O X O O O

Undergraduate course in

applied statistics for

mathematics majors

X O O O X X X X X X O X X X O

First biostatistics course for

beginning graduate students

X X X X X X X X X X O X X X O

Biostatistics course for graduate

health sciences students who

have completed an introductory

statistics course

X O O O O X X X X X X X X X X

X: Suggested coverage; O: Optional coverage.

viii PREFACE

3GFPREF 11/08/2012 1:59:19 Page 9

SUPPLEMENTS

Instructor’s Solutions Manual. Prepared by Dr. Chad Cross, this manual includes

solutions to all problems found in the text. This manual is available only to instructors

who have adopted the text.

Student Solutions Manual. Prepared by Dr. Chad Cross, this manual includes solutions

to all odd-numbered exercises. This manual may be packaged with the text at a discounted

price.

Data Sets. More than 250 data sets are available online to accompany the text. These data

sets include those data presented in examples, exercises, review exercises, and the large

data sets found in some chapters. These are available in SAS, SPSS, and Minitab formats

as well as CSV format for importing into other programs. Data are available for down-

loading at

www.wiley.com /college/daniel

Those without Internet access may contact Wiley directly at 111 River Street, Hoboken, NJ

07030-5774; telephone: 1-877-762-2974.

ACKNOWLEDGMENTS

Many reviewers, students, and faculty have made contributions to this text through their

careful review, inquisitive questions, and professional discussion of topics. In particular,

we would like to thank Dr. Sheniz Moonie of the University of Nevada, Las Vegas; Dr. Roy

T. Sabo of Virginia Commonwealth University; and Dr. Derek Webb, Bemidji State

University for their useful comments on the ninth edition of this text.

There are three additional important acknowledgments that must be made to

important contributors of the text. Dr. John. P. Holcomb of Cleveland State University

updated many of the examples and exercises found in the text. Dr. Edward Danial of

Morgan State University provided an extensive accuracy review of the ninth edition of the

text, and his valuable comments added greatly to the book. Dr. Jodi B. A. McKibben of the

Uniformed Services University of the Health Sciences provided an extensive accuracy

review of the current edition of the book.

We wish to acknowledge the cooperation of Minitab, Inc. for making available

to the authors over many years and editions of the book the latest versions of their

software.

Thanks are due to Professors Geoffrey Churchill and Brian Schott of Georgia State

University who wrote computer programs for generating some of the Appendix tables,

and to Professor Lillian Lin, who read and commented on the logistic regression material

in earlier editions of the book. Additionally, Dr. James T. Wassell provided useful

PREFACE ix

3GFPREF 11/08/2012 1:59:19 Page 10

assistance with some of the survival analysis methods presented in earlier editions of

the text.

We are grateful to the many researchers in the health sciences field who publish their

results and hence make available data that provide valuable practice to the students of

biostatistics.

Wayne W. Daniel

Chad L. Cross

Ã

Ã

The views presented in this book are those of the author and do not necessarily represent the views of the U.S.

Department of Veterans Affairs.

x PREFACE

3GFTOC 11/08/2012 2:16:14 Page 11

BRIEF CONTENTS

1 INTRODUCTIONTOBIOSTATISTICS 1

2 DESCRIPTIVE STATISTICS 19

3 SOME BASIC PROBABILITY

CONCEPTS 65

4 PROBABILITY DISTRIBUTIONS 92

5 SOME IMPORTANT SAMPLING

DISTRIBUTIONS 134

6 ESTIMATION 161

7 HYPOTHESIS TESTING 214

8 ANALYSIS OF VARIANCE 304

9 SIMPLE LINEAR REGRESSIONAND

CORRELATION 413

10 MULTIPLE REGRESSIONAND

CORRELATION 489

11 REGRESSIONANALYSIS: SOME

ADDITIONAL TECHNIQUES 539

12 THE CHI-SQUARE DISTRIBUTION

ANDTHE ANALYSIS OF

FREQUENCIES 600

13 NONPARAMETRIC AND

DISTRIBUTION-FREE STATISTICS 670

14 SURVIVAL ANALYSIS 750

15 VITAL STATISTICS (ONLINE)

APPENDIX: STATISTICAL TABLES A-1

ANSWERS TOODD-NUMBERED

EXERCISES A-107

INDEX I-1

xi

3GFTOC 11/08/2012 2:16:14 Page 12

3GFTOC 11/08/2012 2:16:14 Page 13

CONTENTS

1 INTRODUCTIONTOBIOSTATISTICS 1

1.1 Introduction 2

1.2 Some Basic Concepts 2

1.3 Measurement and Measurement Scales 5

1.4 Sampling and Statistical Inference 7

1.5 The Scientific Method and the Design of

Experiments 13

1.6 Computers and Biostatistical Analysis 15

1.7 Summary 16

Review Questions and Exercises 17

References 18

2 DESCRIPTIVE STATISTICS 19

2.1 Introduction 20

2.2 The Ordered Array 20

2.3 Grouped Data: The Frequency Distribution 22

2.4 Descriptive Statistics: Measures of Central

Tendency 38

2.5 Descriptive Statistics: Measures of Dispersion 43

2.6 Summary 55

Review Questions and Exercises 57

References 63

3 SOME BASIC PROBABILITY

CONCEPTS 65

3.1 Introduction 65

3.2 Two Views of Probability: Objective and

Subjective 66

3.3 Elementary Properties of Probability 68

3.4 Calculating the Probability of an Event 69

3.5 Bayes’ Theorem, Screening Tests, Sensitivity,

Specificity, and Predictive Value Positive and

Negative 78

3.6 Summary 84

Review Questions and Exercises 85

References 90

4 PROBABILITY DISTRIBUTIONS 92

4.1 Introduction 93

4.2 Probability Distributions of Discrete

Variables 93

4.3 The Binomial Distribution 99

4.4 The Poisson Distribution 108

4.5 Continuous Probability Distributions 113

4.6 The Normal Distribution 116

4.7 Normal Distribution Applications 122

4.8 Summary 128

Review Questions and Exercises 130

References 133

5 SOME IMPORTANT SAMPLING

DISTRIBUTIONS 134

5.1 Introduction 134

5.2 Sampling Distributions 135

5.3 Distribution of the Sample Mean 136

5.4 Distribution of the Difference Between Two

Sample Means 145

5.5 Distribution of the Sample Proportion 150

5.6 Distribution of the Difference Between Two

Sample Proportions 154

5.7 Summary 157

Review Questions and Exercises 158

References 160

6 ESTIMATION 161

6.1 Introduction 162

6.2 Confidence Interval for a Population Mean 165

xiii

3GFTOC 11/08/2012 2:16:15 Page 14

6.3 The t Distribution 171

6.4 Confidence Interval for the Difference Between

Two Population Means 177

6.5 Confidence Interval for a Population

Proportion 185

6.6 Confidence Interval for the Difference

Between Two Population

Proportions 187

6.7 Determination of Sample Size for Estimating

Means 189

6.8 Determination of Sample Size for Estimating

Proportions 191

6.9 Confidence Interval for the Variance

of a Normally Distributed

Population 193

6.10 Confidence Interval for the Ratio of the

Variances of Two Normally Distributed

Populations 198

6.11 Summary 203

Review Questions and Exercises 205

References 210

7 HYPOTHESIS TESTING 214

7.1 Introduction 215

7.2 Hypothesis Testing: A Single Population

Mean 222

7.3 Hypothesis Testing: The Difference Between Two

Population Means 236

7.4 Paired Comparisons 249

7.5 Hypothesis Testing: A Single Population

Proportion 257

7.6 Hypothesis Testing: The Difference Between Two

Population Proportions 261

7.7 Hypothesis Testing: A Single Population

Variance 264

7.8 Hypothesis Testing: The Ratio of Two Population

Variances 267

7.9 The Type II Error and the Power of

a Test 272

7.10 Determining Sample Size to Control Type II

Errors 277

7.11 Summary 280

Review Questions and Exercises 282

References 300

8 ANALYSIS OF VARIANCE 304

8.1 Introduction 305

8.2 The Completely Randomized Design 308

8.3 The Randomized Complete Block

Design 334

8.4 The Repeated Measures Design 346

8.5 The Factorial Experiment 358

8.6 Summary 373

Review Questions and Exercises 376

References 408

9 SIMPLE LINEAR REGRESSIONAND

CORRELATION 413

9.1 Introduction 414

9.2 The Regression Model 414

9.3 The Sample Regression Equation 417

9.4 Evaluating the Regression Equation 427

9.5 Using the Regression Equation 441

9.6 The Correlation Model 445

9.7 The Correlation Coefficient 446

9.8 Some Precautions 459

9.9 Summary 460

Review Questions and Exercises 464

References 486

10 MULTIPLE REGRESSIONAND

CORRELATION 489

10.1 Introduction 490

10.2 The Multiple Linear Regression

Model 490

10.3 Obtaining the Multiple Regression

Equation 492

10.4 Evaluating the Multiple Regression

Equation 501

10.5 Using the Multiple Regression

Equation 507

10.6 The Multiple Correlation Model 510

10.7 Summary 523

Review Questions and Exercises 525

References 537

xiv CONTENTS

3GFTOC 11/08/2012 2:16:15 Page 15

11 REGRESSIONANALYSIS: SOME

ADDITIONAL TECHNIQUES 539

11.1 Introduction 540

11.2 Qualitative Independent Variables 543

11.3 Variable Selection Procedures 560

11.4 Logistic Regression 569

11.5 Summary 582

Review Questions and Exercises 583

References 597

12 THE CHI-SQUARE DISTRIBUTIONAND

THE ANALYSIS OF FREQUENCIES 600

12.1 Introduction 601

12.2 The Mathematical Properties of the Chi-Square

Distribution 601

12.3 Tests of Goodness-of-Fit 604

12.4 Tests of Independence 619

12.5 Tests of Homogeneity 630

12.6 The Fisher Exact Test 636

12.7 Relative Risk, Odds Ratio, and the

Mantel–Haenszel Statistic 641

12.8 Summary 655

Review Questions and Exercises 657

References 666

13 NONPARAMETRIC AND

DISTRIBUTION-FREE STATISTICS 670

13.1 Introduction 671

13.2 Measurement Scales 672

13.3 The Sign Test 673

13.4 The Wilcoxon Signed-Rank Test for

Location 681

13.5 The Median Test 686

13.6 The Mann–Whitney Test 690

13.7 The Kolmogorov–Smirnov Goodness-of-Fit

Test 698

13.8 The Kruskal–Wallis One-Way Analysis of Variance

by Ranks 704

13.9 The Friedman Two-Way Analysis of Variance by

Ranks 712

13.10 The Spearman Rank Correlation

Coefficient 718

13.11 Nonparametric Regression Analysis 727

13.12 Summary 730

Review Questions and Exercises 732

References 747

14 SURVIVAL ANALYSIS 750

14.1 Introduction 750

14.2 Time-to-Event Data and Censoring 751

14.3 The Kaplan–Meier Procedure 756

14.4 Comparing Survival Curves 763

14.5 Cox Regression: The Proportional Hazards

Model 768

14.6 Summary 773

Review Questions and Exercises 774

References 777

15 VITAL STATISTICS (ONLINE)

www.wiley.com/college/daniel

15.1 Introduction

15.2 Death Rates and Ratios

15.3 Measures of Fertility

15.4 Measures of Morbidity

15.5 Summary

Review Questions and Exercises

References

APPENDIX: STATISTICAL TABLES A-1

ANSWERS TOODD-NUMBERED

EXERCISES A-107

INDEX I-1

CONTENTS xv

3GFTOC 11/08/2012 2:16:15 Page 16

3GC01 11/07/2012 21:50:37 Page 1

CHAPTER 1

INTRODUCTION TO

BIOSTATISTICS

CHAPTER OVERVIEW

This chapter is intended to provide an overview of the basic statistical

concepts used throughout the textbook. A course in statistics requires the

student to learn many new terms and concepts. This chapter lays the founda-

tion necessary for understanding basic statistical terms and concepts and the

role that statisticians play in promoting scientiﬁc discovery and wisdom.

TOPICS

1.1 INTRODUCTION

1.2 SOME BASIC CONCEPTS

1.3 MEASUREMENT AND MEASUREMENT SCALES

1.4 SAMPLING AND STATISTICAL INFERENCE

1.5 THE SCIENTIFIC METHOD AND THE DESIGN OF EXPERIMENTS

1.6 COMPUTERS AND BIOSTATISTICAL ANALYSIS

1.7 SUMMARY

LEARNING OUTCOMES

After studying this chapter, the student will

1. understand the basic concepts and terminology of biostatistics, including the

various kinds of variables, measurement, and measurement scales.

2. be able to select a simple random sample and other scientiﬁc samples from a

population of subjects.

3. understand the processes involved in the scientiﬁc method and the design of

experiments.

4. appreciate the advantages of using computers in the statistical analysis of data

generated by studies and experiments conducted by researchers in the health

sciences.

1

3GC01 11/07/2012 21:50:37 Page 2

1.1 INTRODUCTION

We are frequently reminded of the fact that we are living in the information age.

Appropriately, then, this book is about information—how it is obtained, how it is analyzed,

and how it is interpreted. The information about which we are concerned we call data, and

the data are available to us in the form of numbers.

The objectives of this book are twofold: (1) to teach the student to organize and

summarize data, and (2) to teach the student how to reach decisions about a large body of

data by examining only a small part of it. The concepts and methods necessary for

achieving the first objective are presented under the heading of descriptive statistics, and

the second objective is reached through the study of what is called inferential statistics.

This chapter discusses descriptive statistics. Chapters 2 through 5 discuss topics that form

the foundation of statistical inference, and most of the remainder of the book deals with

inferential statistics.

Because this volume is designed for persons preparing for or already pursuing a

career in the health field, the illustrative material and exercises reflect the problems and

activities that these persons are likely to encounter in the performance of their duties.

1.2 SOME BASIC CONCEPTS

Like all fields of learning, statistics has its own vocabulary. Some of the words and phrases

encountered in the study of statistics will be new to those not previously exposed to the

subject. Other terms, though appearing to be familiar, may have specialized meanings that

are different from the meanings that we are accustomed to associating with these terms.

The following are some terms that we will use extensively in this book.

Data The raw material of statistics is data. For our purposes we may define data as

numbers. The two kinds of numbers that we use in statistics are numbers that result from

the taking—in the usual sense of the term—of a measurement, and those that result

from the process of counting. For example, when a nurse weighs a patient or takes

a patient’s temperature, a measurement, consisting of a number such as 150 pounds or

100 degrees Fahrenheit, is obtained. Quite a different type of number is obtained when a

hospital administrator counts the number of patients—perhaps 20—discharged from the

hospital on a given day. Each of the three numbers is a datum, and the three taken

together are data.

Statistics The meaning of statistics is implicit in the previous section. More

concretely, however, we may say that statistics is a field of study concerned with (1)

the collection, organization, summarization, and analysis of data; and (2) the drawing of

inferences about a body of data when only a part of the data is observed.

The person who performs these statistical activities must be prepared to interpret and

to communicate the results to someone else as the situation demands. Simply put, we may

say that data are numbers, numbers contain information, and the purpose of statistics is to

investigate and evaluate the nature and meaning of this information.

2 CHAPTER 1 INTRODUCTION TO BIOSTATISTICS

3GC01 11/07/2012 21:50:37 Page 3

Sources of Data The performance of statistical activities is motivated by the

need to answer a question. For example, clinicians may want answers to questions

regarding the relative merits of competing treatment procedures. Administrators may

want answers to questions regarding such areas of concern as employee morale or

facility utilization. When we determine that the appropriate approach to seeking an

answer to a question will require the use of statistics, we begin to search for suitable data

to serve as the raw material for our investigation. Such data are usually available from

one or more of the following sources:

1. Routinely kept records. It is difficult to imagine any type of organization that

does not keep records of day-to-day transactions of its activities. Hospital medical

records, for example, contain immense amounts of information on patients, while

hospital accounting records contain a wealth of data on the facility’s business

activities. When the need for data arises, we should look for them first among

routinely kept records.

2. Surveys. If the data needed to answer a question are not available from routinely

kept records, the logical source may be a survey. Suppose, for example, that the

administrator of a clinic wishes to obtain information regarding the mode of

transportation used by patients to visit the clinic. If admission forms do not contain

a question on mode of transportation, we may conduct a survey among patients to

obtain this information.

3. Experiments. Frequently the data needed to answer a question are available only as

the result of an experiment. A nurse may wish to know which of several strategies is

best for maximizing patient compliance. The nurse might conduct an experiment in

which the different strategies of motivating compliance are tried with different

patients. Subsequent evaluation of the responses to the different strategies might

enable the nurse to decide which is most effective.

4. External sources. The data needed to answer a question may already exist in the

form of published reports, commercially available data banks, or the research

literature. In other words, we may find that someone else has already asked the

same question, and the answer obtained may be applicable to our present

situation.

Biostatistics The tools of statistics are employed in many fields—business,

education, psychology, agriculture, and economics, to mention only a few. When the

data analyzed are derived from the biological sciences and medicine, we use the term

biostatistics to distinguish this particular application of statistical tools and concepts. This

area of application is the concern of this book.

Variable If, as we observe a characteristic, we find that it takes on different values

in different persons, places, or things, we label the characteristic a variable. We do this

for the simple reason that the characteristic is not the same when observed in different

possessors of it. Some examples of variables include diastolic blood pressure, heart rate,

the heights of adult males, the weights of preschool children, and the ages of patients

seen in a dental clinic.

1.2 SOME BASIC CONCEPTS 3

3GC01 11/07/2012 21:50:37 Page 4

Quantitative Variables A quantitative variable is one that can be measured in

the usual sense. We can, for example, obtain measurements on the heights of adult males,

the weights of preschool children, and the ages of patients seen in a dental clinic. These are

examples of quantitative variables. Measurements made on quantitative variables convey

information regarding amount.

Qualitative Variables Some characteristics are not capable of being measured

in the sense that height, weight, and age are measured. Many characteristics can be

categorized only, as, for example, when an ill person is given a medical diagnosis, a

person is designated as belonging to an ethnic group, or a person, place, or object is

said to possess or not to possess some characteristic of interest. In such cases

measuring consists of categorizing. We refer to variables of this kind as qualitative

variables. Measurements made on qualitative variables convey information regarding

attribute.

Although, in the case of qualitative variables, measurement in the usual sense of the

word is not achieved, we can count the number of persons, places, or things belonging to

various categories. A hospital administrator, for example, can count the number of patients

admitted during a day under each of the various admitting diagnoses. These counts, or

frequencies as they are called, are the numbers that we manipulate when our analysis

involves qualitative variables.

Random Variable Whenever we determine the height, weight, or age of an

individual, the result is frequently referred to as a value of the respective variable.

When the values obtained arise as a result of chance factors, so that they cannot be

exactly predicted in advance, the variable is called a random variable. An example of a

random variable is adult height. When a child is born, we cannot predict exactly his or her

height at maturity. Attained adult height is the result of numerous genetic and environ-

mental factors. Values resulting from measurement procedures are often referred to as

observations or measurements.

Discrete Random Variable Variables may be characterized further as to

whether they are discrete or continuous. Since mathematically rigorous definitions of

discrete and continuous variables are beyond the level of this book, we offer, instead,

nonrigorous definitions and give an example of each.

A discrete variable is characterized by gaps or interruptions in the values that it can

assume. These gaps or interruptions indicate the absence of values between particular

values that the variable can assume. Some examples illustrate the point. The number of

daily admissions to a general hospital is a discrete random variable since the number of

admissions each day must be represented by a whole number, such as 0, 1, 2, or 3. The

number of admissions on a given day cannot be a number such as 1.5, 2.997, or 3.333. The

number of decayed, missing, or filled teeth per child in an elementary school is another

example of a discrete variable.

Continuous Random Variable A continuous random variable does not

possess the gaps or interruptions characteristic of a discrete random variable. A

continuous random variable can assume any value within a specified relevant interval

4 CHAPTER 1 INTRODUCTION TO BIOSTATISTICS

3GC01 11/07/2012 21:50:37 Page 5

of values assumed by the variable. Examples of continuous variables include the various

measurements that can be made on individuals such as height, weight, and skull

circumference. No matter how close together the observed heights of two people, for

example, we can, theoretically, find another person whose height falls somewhere in

between.

Because of the limitations of available measuring instruments, however, observa-

tions on variables that are inherently continuous are recorded as if they were discrete.

Height, for example, is usually recorded to the nearest one-quarter, one-half, or whole

inch, whereas, with a perfect measuring device, such a measurement could be made as

precise as desired.

Population The average person thinks of a population as a collection of entities,

usually people. A population or collection of entities may, however, consist of animals,

machines, places, or cells. For our purposes, we define a population of entities as the

largest collection of entities for which we have an interest at a particular time. If we take a

measurement of some variable on each of the entities in a population, we generate a

population of values of that variable. We may, therefore, define a population of values as

the largest collection of values of a random variable for which we have an interest at a

particular time. If, for example, we are interested in the weights of all the children enrolled

in a certain county elementary school system, our population consists of all these weights.

If our interest lies only in the weights of first-grade students in the system, we have a

different population—weights of first-grade students enrolled in the school system. Hence,

populations are determined or defined by our sphere of interest. Populations may be finite

or infinite. If a population of values consists of a fixed number of these values, the

population is said to be finite. If, on the other hand, a population consists of an endless

succession of values, the population is an infinite one.

Sample A sample may be defined simply as a part of a population. Suppose our

population consists of the weights of all the elementary school children enrolled in a certain

county school system. If we collect for analysis the weights of only a fraction of these

children, we have only a part of our population of weights, that is, we have a sample.

1.3 MEASUREMENT AND

MEASUREMENT SCALES

In the preceding discussion we used the word measurement several times in its usual sense,

and presumably the reader clearly understood the intended meaning. The word measure-

ment, however, may be given a more scientific definition. In fact, there is a whole body of

scientific literature devoted to the subject of measurement. Part of this literature is

concerned also with the nature of the numbers that result from measurements. Authorities

on the subject of measurement speak of measurement scales that result in the categoriza-

tion of measurements according to their nature. In this section we define measurement and

the four resulting measurement scales. A more detailed discussion of the subject is to be

found in the writings of Stevens (1,2).

1.3 MEASUREMENT AND MEASUREMENT SCALES 5

3GC01 11/07/2012 21:50:37 Page 6

Measurement This may be defined as the assignment of numbers to objects or

events according to a set of rules. The various measurement scales result from the fact that

measurement may be carried out under different sets of rules.

The Nominal Scale The lowest measurement scale is the nominal scale. As the

name implies it consists of “naming” observations or classifying them into various

mutually exclusive and collectively exhaustive categories. The practice of using numbers

to distinguish among the various medical diagnoses constitutes measurement on a nominal

scale. Other examples include such dichotomies as male–female, well–sick, under 65 years

of age–65 and over, child–adult, and married–not married.

The Ordinal Scale Whenever observations are not only different from category to

category but can be ranked according to some criterion, they are said to be measured on an

ordinal scale. Convalescing patients may be characterized as unimproved, improved, and

much improved. Individuals may be classified according to socioeconomic status as low,

medium, or high. The intelligence of children may be above average, average, or below

average. In each of these examples the members of any one category are all considered

equal, but the members of one category are considered lower, worse, or smaller than those

in another category, which in turn bears a similar relationship to another category. For

example, a much improved patient is in better health than one classified as improved, while

a patient who has improved is in better condition than one who has not improved. It is

usually impossible to infer that the difference between members of one category and the

next adjacent category is equal to the difference between members of that category and the

members of the next category adjacent to it. The degree of improvement between

unimproved and improved is probably not the same as that between improved and

much improved. The implication is that if a finer breakdown were made resulting in

more categories, these, too, could be ordered in a similar manner. The function of numbers

assigned to ordinal data is to order (or rank) the observations from lowest to highest and,

hence, the term ordinal.

The Interval Scale The interval scale is a more sophisticatedscale thanthe nominal

or ordinal in that with this scale not only is it possible to order measurements, but also the

distance between any two measurements is known. We know, say, that the difference between

a measurement of 20 and a measurement of 30 is equal to the difference between

measurements of 30 and 40. The ability to do this implies the use of a unit distance and

a zero point, both of which are arbitrary. The selected zero point is not necessarily a true zero

in that it does not have to indicate a total absence of the quantity being measured. Perhaps the

best example of an interval scale is provided by the way in which temperature is usually

measured (degrees Fahrenheit or Celsius). The unit of measurement is the degree, and the

point of comparison is the arbitrarily chosen “zero degrees,” which does not indicate a lackof

heat. The interval scale unlike the nominal and ordinal scales is a truly quantitative scale.

The Ratio Scale The highest level of measurement is the ratio scale. This scale is

characterized by the fact that equality of ratios as well as equality of intervals may be

determined. Fundamental to the ratio scale is a true zero point. The measurement of such

familiar traits as height, weight, and length makes use of the ratio scale.

6 CHAPTER 1 INTRODUCTION TO BIOSTATISTICS

3GC01 11/07/2012 21:50:37 Page 7

1.4 SAMPLINGAND

STATISTICAL INFERENCE

As noted earlier, one of the purposes of this book is to teach the concepts of statistical

inference, which we may define as follows:

DEFINITION

Statistical inference is the procedure by which we reach a conclusion

about a population on the basis of the information contained in a sample

that has been drawn from that population.

There are many kinds of samples that may be drawn from a population. Not every

kind of sample, however, can be used as a basis for making valid inferences about a

population. In general, in order to make a valid inference about a population, we need a

scientific sample from the population. There are also many kinds of scientific samples that

may be drawn froma population. The simplest of these is the simple randomsample. In this

section we define a simple random sample and show you how to draw one from a

population.

If we use the letter N to designate the size of a finite population and the letter n to

designate the size of a sample, we may define a simple random sample as follows:

DEFINITION

If a sample of size n is drawn from a population of size N in such a way

that every possible sample of size n has the same chance of being selected,

the sample is called a simple random sample.

The mechanics of drawing a sample to satisfy the definition of a simple random

sample is called simple random sampling.

We will demonstrate the procedure of simple randomsampling shortly, but first let us

consider the problemof whether to sample with replacement or without replacement. When

sampling with replacement is employed, every member of the population is available at

each draw. For example, suppose that we are drawing a sample from a population of former

hospital patients as part of a study of length of stay. Let us assume that the sampling

involves selecting from the shelves in the medical records department a sample of charts of

discharged patients. In sampling with replacement we would proceed as follows: select a

chart to be in the sample, record the length of stay, and return the chart to the shelf. The

chart is back in the “population” and may be drawn again on some subsequent draw, in

which case the length of stay will again be recorded. In sampling without replacement, we

would not return a drawn chart to the shelf after recording the length of stay, but would lay

it aside until the entire sample is drawn. Following this procedure, a given chart could

appear in the sample only once. As a rule, in practice, sampling is always done without

replacement. The significance and consequences of this will be explained later, but first let

us see howone goes about selecting a simple randomsample. To ensure true randomness of

selection, we will need to follow some objective procedure. We certainly will want to avoid

1.4 SAMPLING AND STATISTICAL INFERENCE 7

3GC01 11/07/2012 21:50:39 Page 8

using our own judgment to decide which members of the population constitute a random

sample. The following example illustrates one method of selecting a simple randomsample

from a population.

EXAMPLE 1.4.1

Gold et al. (A-1) studied the effectiveness on smoking cessation of bupropion SR, a

nicotine patch, or both, when co-administered with cognitive-behavioral therapy. Consec-

utive consenting patients assigned themselves to one of the three treatments. For illustrative

purposes, let us consider all these subjects to be a population of size N¼189. We wish to

select a simple random sample of size 10 from this population whose ages are shown in

Table 1.4.1.

TABLE 1.4.1 Ages of 189 Subjects Who Participated in a Study on Smoking

Cessation

Subject No. Age Subject No. Age Subject No. Age Subject No. Age

1 48 49 38 97 51 145 52

2 35 50 44 98 50 146 53

3 46 51 43 99 50 147 61

4 44 52 47 100 55 148 60

5 43 53 46 101 63 149 53

6 42 54 57 102 50 150 53

7 39 55 52 103 59 151 50

8 44 56 54 104 54 152 53

9 49 57 56 105 60 153 54

10 49 58 53 106 50 154 61

11 44 59 64 107 56 155 61

12 39 60 53 108 68 156 61

13 38 61 58 109 66 157 64

14 49 62 54 110 71 158 53

15 49 63 59 111 82 159 53

16 53 64 56 112 68 160 54

17 56 65 62 113 78 161 61

18 57 66 50 114 66 162 60

19 51 67 64 115 70 163 51

20 61 68 53 116 66 164 50

21 53 69 61 117 78 165 53

22 66 70 53 118 69 166 64

23 71 71 62 119 71 167 64

24 75 72 57 120 69 168 53

25 72 73 52 121 78 169 60

26 65 74 54 122 66 170 54

27 67 75 61 123 68 171 55

28 38 76 59 124 71 172 58

(Continued)

8 CHAPTER 1 INTRODUCTION TO BIOSTATISTICS

3GC01 11/07/2012 21:50:39 Page 9

Solution: One way of selecting a simple random sample is to use a table of random

numbers like that shown in the Appendix, Table A. As the first step, we locate

a random starting point in the table. This can be done in a number of ways,

one of which is to look away from the page while touching it with the point of

a pencil. The random starting point is the digit closest to where the pencil

touched the page. Let us assume that following this procedure led to a random

starting point in Table A at the intersection of row 21 and column 28. The

digit at this point is 5. Since we have 189 values to choose from, we can use

only the random numbers 1 through 189. It will be convenient to pick three-

digit numbers so that the numbers 001 through 189 will be the only eligible

numbers. The first three-digit number, beginning at our random starting point

is 532, a number we cannot use. The next number (going down) is 196, which

again we cannot use. Let us move down past 196, 372, 654, and 928 until we

come to 137, a number we can use. The age of the 137th subject from Table

1.4.1 is 43, the first value in our sample. We record the random number and

the corresponding age in Table 1.4.2. We record the random number to keep

track of the random numbers selected. Since we want to sample without

replacement, we do not want to include the same individual’s age twice.

Proceeding in the manner just described leads us to the remaining nine

random numbers and their corresponding ages shown in Table 1.4.2. Notice

that when we get to the end of the column, we simply move over three digits

29 37 77 57 125 69 173 62

30 46 78 52 126 77 174 62

31 44 79 54 127 76 175 54

32 44 80 53 128 71 176 53

33 48 81 62 129 43 177 61

34 49 82 52 130 47 178 54

35 30 83 62 131 48 179 51

36 45 84 57 132 37 180 62

37 47 85 59 133 40 181 57

38 45 86 59 134 42 182 50

39 48 87 56 135 38 183 64

40 47 88 57 136 49 184 63

41 47 89 53 137 43 185 65

42 44 90 59 138 46 186 71

43 48 91 61 139 34 187 71

44 43 92 55 140 46 188 73

45 45 93 61 141 46 189 66

46 40 94 56 142 48

47 48 95 52 143 47

48 49 96 54 144 43

Source: Data provided courtesy of Paul B. Gold, Ph.D.

Subject No. Age Subject No. Age Subject No. Age Subject No. Age

1.4 SAMPLING AND STATISTICAL INFERENCE 9

3GC01 11/07/2012 21:50:40 Page 10

to 028 and proceed up the column. We could have started at the top with the

number 369.

Thus we have drawn a simple random sample of size 10 from a

population of size 189. In future discussions, whenever the term simple

random sample is used, it will be understood that the sample has been drawn

in this or an equivalent manner. &

The preceding discussion of random sampling is presented because of the important

role that the sampling process plays in designing research studies and experiments. The

methodology and concepts employed in sampling processes will be described in more

detail in Section 1.5.

DEFINITION

A research study is a scientific study of a phenomenon of interest.

Research studies involve designing sampling protocols, collecting and

analyzing data, and providing valid conclusions based on the results of

the analyses.

DEFINITION

Experiments are a special type of research study in which observations

are made after specific manipulations of conditions have been carried

out; they provide the foundation for scientific research.

Despite the tremendous importance of random sampling in the design of research

studies and experiments, there are some occasions when random sampling may not be the

most appropriate method to use. Consequently, other sampling methods must be consid-

ered. The intention here is not to provide a comprehensive reviewof sampling methods, but

TABLE 1.4.2 Sample of

10 Ages Drawn from the

Ages in Table 1.4.1

Random

Number

Sample

Subject Number Age

137 1 43

114 2 66

155 3 61

183 4 64

185 5 65

028 6 38

085 7 59

181 8 57

018 9 57

164 10 50

10 CHAPTER 1 INTRODUCTION TO BIOSTATISTICS

3GC01 11/07/2012 21:50:40 Page 11

rather to acquaint the student with two additional sampling methods that are employed in

the health sciences, systematic sampling and stratified randomsampling. Interested readers

are referred to the books by Thompson (3) and Levy and Lemeshow (4) for detailed

overviews of various sampling methods and explanations of how sample statistics are

calculated when these methods are applied in research studies and experiments.

Systematic Sampling A sampling method that is widely used in healthcare

research is the systematic sample. Medical records, which contain raw data used in

healthcare research, are generally stored in a file system or on a computer and hence are

easy to select in a systematic way. Using systematic sampling methodology, a researcher

calculates the total number of records needed for the study or experiment at hand. A

random numbers table is then employed to select a starting point in the file system. The

record located at this starting point is called record x. A second number, determined by the

number of records desired, is selected to define the sampling interval (call this interval k).

Consequently, the data set would consist of records x, x þk, x þ2k, x þ3k, and so on, until

the necessary number of records are obtained.

EXAMPLE 1.4.2

Continuing with the study of Gold et al. (A-1) illustrated in the previous example, imagine

that we wanted a systematic sample of 10 subjects from those listed in Table 1.4.1.

Solution: To obtain a starting point, we will again use Appendix Table A. For purposes

of illustration, let us assume that the random starting point in Table Awas the

intersection of row 10 and column 30. The digit is a 4 and will serve as our

starting point, x. Since we are starting at subject 4, this leaves 185 remaining

subjects (i.e., 189–4) from which to choose. Since we wish to select 10

subjects, one method to define the sample interval, k, would be to take

185/10 ¼18.5. To ensure that there will be enough subjects, it is customary to

round this quotient down, and hence we will round the result to 18. The

resulting sample is shown in Table 1.4.3.

&

TABLE 1.4.3 Sample of 10 Ages Selected Using a

Systematic Sample from the Ages in Table 1.4.1

Systematically Selected Subject Number Age

4 44

22 66

40 47

58 53

76 59

94 56

112 68

130 47

148 60

166 64

1.4 SAMPLING AND STATISTICAL INFERENCE 11

3GC01 11/07/2012 21:50:40 Page 12

Stratiﬁed Random Sampling A common situation that may be encountered

in a population under study is one in which the sample units occur together in a grouped

fashion. On occasion, when the sample units are not inherently grouped, it may be possible

and desirable to group them for sampling purposes. In other words, it may be desirable to

partition a population of interest into groups, or strata, in which the sample units within a

particular stratum are more similar to each other than they are to the sample units that

compose the other strata. After the population is stratified, it is customary to take a random

sample independently from each stratum. This technique is called stratified random

sampling. The resulting sample is called a stratified random sample. Although the benefits

of stratified random sampling may not be readily observable, it is most often the case that

random samples taken within a stratum will have much less variability than a random

sample taken across all strata. This is true because sample units within each stratum tend to

have characteristics that are similar.

EXAMPLE 1.4.3

Hospital trauma centers are given ratings depending on their capabilities to treat various

traumas. In this system, a level 1 trauma center is the highest level of available trauma care

and a level 4 trauma center is the lowest level of available trauma care. Imagine that we are

interested in estimating the survival rate of trauma victims treated at hospitals within a

large metropolitan area. Suppose that the metropolitan area has a level 1, a level 2, and a

level 3 trauma center. We wish to take samples of patients fromthese trauma centers in such

a way that the total sample size is 30.

Solution: We assume that the survival rates of patients may depend quite significantly

on the trauma that they experienced and therefore on the level of care that

they receive. As a result, a simple random sample of all trauma patients,

without regard to the center at which they were treated, may not represent

true survival rates, since patients receive different care at the various trauma

centers. One way to better estimate the survival rate is to treat each trauma

center as a stratum and then randomly select 10 patient files from each of the

three centers. This procedure is based on the fact that we suspect that the

survival rates within the trauma centers are less variable than the survival

rates across trauma centers. Therefore, we believe that the stratified random

sample provides a better representation of survival than would a sample taken

without regard to differences within strata. &

It should be noted that two slight modifications of the stratified sampling technique

are frequently employed. To illustrate, consider again the trauma center example. In the

first place, a systematic sample of patient files could have been selected from each trauma

center (stratum). Such a sample is called a stratified systematic sample.

The second modification of stratified sampling involves selecting the sample from a

given stratum in such a way that the number of sample units selected from that stratum is

proportional to the size of the population of that stratum. Suppose, in our trauma center

example that the level 1 trauma center treated 100 patients and the level 2 and level 3

trauma centers treated only 10 each. In that case, selecting a random sample of 10 from

12 CHAPTER 1 INTRODUCTION TO BIOSTATISTICS

3GC01 11/07/2012 21:50:40 Page 13

each trauma center overrepresents the trauma centers with smaller patient loads. To avoid

this problem, we adjust the size of the sample taken from a stratum so that it is proportional

to the size of the stratum’s population. This type of sampling is called stratified sampling

proportional to size. The within-stratum samples can be either random or systematic as

described above.

EXERCISES

1.4.1 Using the table of random numbers, select a new random starting point, and draw another simple

random sample of size 10 from the data in Table 1.4.1. Record the ages of the subjects in this new

sample. Save your data for future use. What is the variable of interest in this exercise? What

measurement scale was used to obtain the measurements?

1.4.2 Select another simple random sample of size 10 from the population represented in Table 1.4.1.

Compare the subjects in this sample with those in the sample drawn in Exercise 1.4.1. Are there any

subjects who showed up in both samples? How many? Compare the ages of the subjects in the two

samples. How many ages in the first sample were duplicated in the second sample?

1.4.3 Using the table of random numbers, select a random sample and a systematic sample, each of size 15,

from the data in Table 1.4.1. Visually compare the distributions of the two samples. Do they appear

similar? Which appears to be the best representation of the data?

1.4.4 Construct an example where it would be appropriate to use stratified sampling. Discuss how you

would use stratified random sampling and stratified sampling proportional to size with this example.

Which do you think would best represent the population that you described in your example? Why?

1.5 THE SCIENTIFIC METHOD

ANDTHE DESIGNOF EXPERIMENTS

Data analyses using a broad range of statistical methods play a significant role in scientific

studies. The previous section highlighted the importance of obtaining samples in a

scientific manner. Appropriate sampling techniques enhance the likelihood that the results

of statistical analyses of a data set will provide valid and scientifically defensible results.

Because of the importance of the proper collection of data to support scientific discovery, it

is necessary to consider the foundation of such discovery—the scientific method—and to

explore the role of statistics in the context of this method.

DEFINITION

The scientific method is a process by which scientific information is

collected, analyzed, and reported in order to produce unbiased and

replicable results in an effort to provide an accurate representation of

observable phenomena.

The scientific method is recognized universally as the only truly acceptable way to

produce new scientific understanding of the world around us. It is based on an empirical

approach, in that decisions and outcomes are based on data. There are several key elements

1.5 THE SCIENTIFIC METHOD AND THE DESIGN OF EXPERIMENTS 13

3GC01 11/07/2012 21:50:40 Page 14

associated with the scientific method, and the concepts and techniques of statistics play a

prominent role in all these elements.

Making an Observation First, an observation is made of a phenomenon or a

group of phenomena. This observation leads to the formulation of questions or uncer-

tainties that can be answered in a scientifically rigorous way. For example, it is readily

observable that regular exercise reduces body weight in many people. It is also readily

observable that changing diet may have a similar effect. In this case there are two

observable phenomena, regular exercise and diet change, that have the same endpoint.

The nature of this endpoint can be determined by use of the scientific method.

Formulating a Hypothesis In the second step of the scientific method a

hypothesis is formulated to explain the observation and to make quantitative predictions

of new observations. Often hypotheses are generated as a result of extensive background

research and literature reviews. The objective is to produce hypotheses that are scientifi-

cally sound. Hypotheses may be stated as either research hypotheses or statistical

hypotheses. Explicit definitions of these terms are given in Chapter 7, which discusses

the science of testing hypotheses. Suffice it to say for now that a research hypothesis from

the weight-loss example would be a statement such as, “Exercise appears to reduce body

weight.” There is certainly nothing incorrect about this conjecture, but it lacks a truly

quantitative basis for testing. A statistical hypothesis may be stated using quantitative

terminology as follows: “The average (mean) loss of body weight of people who exercise is

greater than the average (mean) loss of body weight of people who do not exercise.” In this

statement a quantitative measure, the “average” or “mean” value, is hypothesized to be

greater in the sample of patients who exercise. The role of the statistician in this step of the

scientific method is to state the hypothesis in a way that valid conclusions may be drawn

and to interpret correctly the results of such conclusions.

Designing an Experiment The third step of the scientific method involves

designing an experiment that will yield the data necessary to validly test an appropriate

statistical hypothesis. This step of the scientific method, like that of data analysis, requires

the expertise of a statistician. Improperly designed experiments are the leading cause of

invalid results and unjustified conclusions. Further, most studies that are challenged by

experts are challenged on the basis of the appropriateness or inappropriateness of the

study’s research design.

Those who properly design research experiments make every effort to ensure that the

measurement of the phenomenon of interest is both accurate and precise. Accuracy refers

to the correctness of a measurement. Precision, on the other hand, refers to the consistency

of a measurement. It should be noted that in the social sciences, the term validity is

sometimes used to mean accuracy and that reliability is sometimes used to mean precision.

In the context of the weight-loss example given earlier, the scale used to measure the weight

of study participants would be accurate if the measurement is validated using a scale that is

properly calibrated. If, however, the scale is off by þ3 pounds, then each participant’s

weight would be 3 pounds heavier; the measurements would be precise in that each would

be wrong by þ3 pounds, but the measurements would not be accurate. Measurements that

are inaccurate or imprecise may invalidate research findings.

14 CHAPTER 1 INTRODUCTION TO BIOSTATISTICS

3GC01 11/07/2012 21:50:40 Page 15

The design of an experiment depends on the type of data that need to be collected to

test a specific hypothesis. As discussed in Section 1.2, data may be collected or made

available through a variety of means. For much scientific research, however, the standard

for data collection is experimentation. A true experimental design is one in which study

subjects are randomly assigned to an experimental group (or treatment group) and a control

group that is not directly exposed to a treatment. Continuing the weight-loss example, a

sample of 100 participants could be randomly assigned to two conditions using the

methods of Section 1.4. A sample of 50 of the participants would be assigned to a specific

exercise program and the remaining 50 would be monitored, but asked not to exercise for a

specific period of time. At the end of this experiment the average (mean) weight losses of

the two groups could be compared. The reason that experimental designs are desirable

is that if all other potential factors are controlled, a cause–effect relationship may be tested;

that is, all else being equal, we would be able to conclude or fail to conclude that the

experimental group lost weight as a result of exercising.

The potential complexity of research designs requires statistical expertise, and

Chapter 8 highlights some commonly used experimental designs. For a more in-depth

discussion of research designs, the interested reader may wish to refer to texts by Kuehl (5),

Keppel and Wickens (6), and Tabachnick and Fidell (7).

Conclusion In the execution of a research study or experiment, one would hope to

have collected the data necessary to draw conclusions, with some degree of confidence,

about the hypotheses that were posed as part of the design. It is often the case that

hypotheses need to be modified and retested with new data and a different design.

Whatever the conclusions of the scientific process, however, results are rarely considered

to be conclusive. That is, results need to be replicated, often a large number of times, before

scientific credence is granted them.

EXERCISES

1.5.1 Using the example of weight loss as an endpoint, discuss how you would use the scientific method to

test the observation that change in diet is related to weight loss. Include all of the steps, including the

hypothesis to be tested and the design of your experiment.

1.5.2 Continuing with Exercise 1.5.1, consider how you would use the scientific method to test the

observation that both exercise and change in diet are related to weight loss. Include all of the steps,

paying particular attention to how you might design the experiment and which hypotheses would be

testable given your design.

1.6 COMPUTERS AND

BIOSTATISTICAL ANALYSIS

The widespread use of computers has had a tremendous impact on health sciences research

in general and biostatistical analysis in particular. The necessity to perform long and

tedious arithmetic computations as part of the statistical analysis of data lives only in the

1.6 COMPUTERS AND BIOSTATISTICAL ANALYSIS 15

3GC01 11/07/2012 21:50:40 Page 16

memory of those researchers and practitioners whose careers antedate the so-called

computer revolution. Computers can perform more calculations faster and far more

accurately than can human technicians. The use of computers makes it possible for

investigators to devote more time to the improvement of the quality of raw data and the

interpretation of the results.

The current prevalence of microcomputers and the abundance of available statistical

software programs have further revolutionized statistical computing. The reader in search

of a statistical software package may wish to consult The American Statistician, a quarterly

publication of the American Statistical Association. Statistical software packages are

regularly reviewed and advertised in the periodical.

Computers currently on the market are equipped with random number generating

capabilities. As an alternative to using printed tables of randomnumbers, investigators may

use computers to generate the randomnumbers they need. Actually, the “random” numbers

generated by most computers are in reality pseudorandom numbers because they are the

result of a deterministic formula. However, as Fishman (8) points out, the numbers appear

to serve satisfactorily for many practical purposes.

The usefulness of the computer in the health sciences is not limited to statistical

analysis. The reader interested in learning more about the use of computers in the health

sciences will find the books by Hersh (4), Johns (5), Miller et al. (6), and Saba and

McCormick (7) helpful. Those who wish to derive maximum benefit from the Internet may

wish to consult the books Physicians’ Guide to the Internet (13) and Computers in

Nursing’s Nurses’ Guide to the Internet (14). Current developments in the use of computers

in biology, medicine, and related fields are reported in several periodicals devoted to

the subject. A few such periodicals are Computers in Biology and Medicine, Computers

and Biomedical Research, International Journal of Bio-Medical Computing, Computer

Methods and Programs in Biomedicine, Computer Applications in the Biosciences, and

Computers in Nursing.

Computer printouts are used throughout this book to illustrate the use of computers in

biostatistical analysis. The MINITAB, SPSS, R, and SAS

®

statistical software packages for

the personal computer have been used for this purpose.

1.7 SUMMARY

In this chapter we introduced the reader to the basic concepts of statistics. We defined

statistics as an area of study concerned with collecting and describing data and with making

statistical inferences. We defined statistical inference as the procedure by which we reach a

conclusion about a population on the basis of information contained in a sample drawn

fromthat population. We learned that a basic type of sample that will allowus to make valid

inferences is the simple random sample. We learned how to use a table of random numbers

to draw a simple random sample from a population.

The reader is provided with the definitions of some basic terms, such as variable

and sample, that are used in the study of statistics. We also discussed measurement and

defined four measurement scales—nominal, ordinal, interval, and ratio. The reader is

16 CHAPTER 1 INTRODUCTION TO BIOSTATISTICS

3GC01 11/07/2012 21:50:40 Page 17

also introduced to the scientific method and the role of statistics and the statistician in

this process.

Finally, we discussed the importance of computers in the performance of the

activities involved in statistics.

REVIEWQUESTIONS ANDEXERCISES

1. Explain what is meant by descriptive statistics.

2. Explain what is meant by inferential statistics.

3. Define:

(a) Statistics (b) Biostatistics

(c) Variable (d) Quantitative variable

(e) Qualitative variable (f) Random variable

(g) Population (h) Finite population

(i) Infinite population (j) Sample

(k) Discrete variable (l) Continuous variable

(m) Simple random sample (n) Sampling with replacement

(o) Sampling without replacement

4. Define the word measurement.

5. List, describe, and compare the four measurement scales.

6. For each of the following variables, indicate whether it is quantitative or qualitative and specify the

measurement scale that is employed when taking measurements on each:

(a) Class standing of the members of this class relative to each other

(b) Admitting diagnosis of patients admitted to a mental health clinic

(c) Weights of babies born in a hospital during a year

(d) Gender of babies born in a hospital during a year

(e) Range of motion of elbow joint of students enrolled in a university health sciences curriculum

(f) Under-arm temperature of day-old infants born in a hospital

7. For each of the following situations, answer questions a through e:

(a) What is the sample in the study?

(b) What is the population?

(c) What is the variable of interest?

(d) How many measurements were used in calculating the reported results?

(e) What measurement scale was used?

Situation A. A study of 300 households in a small southern town revealed that 20 percent had at least

one school-age child present.

Situation B. A study of 250 patients admitted to a hospital during the past year revealed that, on the

average, the patients lived 15 miles from the hospital.

8. Consider the two situations given in Exercise 7. For Situation A describe how you would use a

stratified random sample to collect the data. For Situation B describe how you would use systematic

sampling of patient records to collect the data.

REVIEWQUESTIONS AND EXERCISES 17

3GC01 11/07/2012 21:50:40 Page 18

REFERENCES

Methodology References

1. S. S. STEVENS, “On the Theory of Scales of Measurement,” Science, 103 (1946), 677–680.

2. S. S. STEVENS, “Mathematics, Measurement and Psychophysics,” in S. S. Stevens (ed.), Handbook of Experimental

Psychology, Wiley, New York, 1951.

3. STEVEN K. THOMPSON, Sampling (2nd ed.), Wiley, New York, 2002.

4. PAUL S. LEVY and STANLEY LEMESHOW, Sampling of Populations: Methods and Applications (3rd ed.), Wiley,

New York, 1999.

5. ROBERT O. KUEHL, Statistical Principles of Research Design and Analysis (2nd ed.), Duxbury Press, Belmont, CA,

1999.

6. GEOFFREY KEPPEL and THOMAS D. WICKENS, Design and Analysis: A Researcher’s Handbook (4th ed.), Prentice

Hall, Upper Saddle River, NJ, 2004.

7. BARBARA G. TABACHNICK and LINDA S. FIDELL, Experimental Designs using ANOVA, Thomson, Belmont, CA, 2007.

8. GEORGE S. FISHMAN, Concepts and Methods in Discrete Event Digital Simulation, Wiley, New York, 1973.

9. WILLIAM R. HERSH, Information Retrieval: A Health Care Perspective, Springer, New York, 1996.

10. MERIDA L. JOHNS, Information Management for Health Professions, Delmar Publishers, Albany, NY, 1997.

11. MARVIN J. MILLER, KENRIC W. HAMMOND, and MATTHEW G. HILE (eds.), Mental Health Computing, Springer,

New York, 1996.

12. VIRGINIA K. SABA and KATHLEEN A. MCCORMICK, Essentials of Computers for Nurses, McGraw-Hill, New York,

1996.

13. LEE HANCOCK, Physicians’ Guide to the Internet, Lippincott Williams & Wilkins Publishers, Philadelphia, 1996.

14. LESLIE H. NICOLL and TEENA H. OUELLETTE, Computers in Nursing’s Nurses’ Guide to the Internet, 3rd ed.,

Lippincott Williams & Wilkins Publishers, Philadelphia, 2001.

Applications References

A-1. PAUL B. GOLD, ROBERT N. RUBEY, and RICHARD T. HARVEY, “Naturalistic, Self-Assignment Comparative Trial

of Bupropion SR, a Nicotine Patch, or Both for Smoking Cessation Treatment in Primary Care,” American Journal

on Addictions, 11 (2002), 315–331.

18 CHAPTER 1 INTRODUCTION TO BIOSTATISTICS

3GC02 11/07/2012 21:58:58 Page 19

CHAPTER 2

DESCRIPTIVE STATISTICS

CHAPTER OVERVIEW

This chapter introduces a set of basic procedures and statistical measures for

describing data. Data generally consist of an extensive number of measure-

ments or observations that are toonumerous or complicatedtobe understood

through simple observation. Therefore, this chapter introduces several tech-

niques including the construction of tables, graphical displays, and basic

statistical computations that provide ways to condense and organize infor-

mation into a set of descriptive measures and visual devices that enhance the

understanding of complex data.

TOPICS

2.1 INTRODUCTION

2.2 THE ORDERED ARRAY

2.3 GROUPED DATA: THE FREQUENCY DISTRIBUTION

2.4 DESCRIPTIVE STATISTICS: MEASURES OF CENTRAL TENDENCY

2.5 DESCRIPTIVE STATISTICS: MEASURES OF DISPERSION

2.6 SUMMARY

LEARNING OUTCOMES

After studying this chapter, the student will

1. understand how data can be appropriately organized and displayed.

2. understand how to reduce data sets into a few useful, descriptive measures.

3. be able to calculate and interpret measures of central tendency, such as the mean,

median, and mode.

4. be able to calculate and interpret measures of dispersion, such as the range,

variance, and standard deviation.

19

3GC02 11/07/2012 21:58:58 Page 20

2.1 INTRODUCTION

In Chapter 1 we stated that the taking of a measurement and the process of counting yield

numbers that contain information. The objective of the person applying the tools of

statistics to these numbers is to determine the nature of this information. This task is made

much easier if the numbers are organized and summarized. When measurements of a

random variable are taken on the entities of a population or sample, the resulting values are

made available to the researcher or statistician as a mass of unordered data. Measurements

that have not been organized, summarized, or otherwise manipulated are called raw data.

Unless the number of observations is extremely small, it will be unlikely that these rawdata

will impart much information until they have been put into some kind of order.

In this chapter we learn several techniques for organizing and summarizing data so

that we may more easily determine what information they contain. The ultimate in

summarization of data is the calculation of a single number that in some way conveys

important information about the data from which it was calculated. Such single numbers

that are used to describe data are called descriptive measures. After studying this chapter

you will be able to compute several descriptive measures for both populations and samples

of data.

The purpose of this chapter is to equip you with skills that will enable you to

manipulate the information—in the form of numbers—that you encounter as a health

sciences professional. The better able you are to manipulate such information, the better

understanding you will have of the environment and forces that generate the information.

2.2 THE ORDEREDARRAY

A first step in organizing data is the preparation of an ordered array. An ordered array is a

listing of the values of a collection (either population or sample) in order of magnitude from

the smallest value to the largest value. If the number of measurements to be ordered is of

any appreciable size, the use of a computer to prepare the ordered array is highly desirable.

An ordered array enables one to determine quickly the value of the smallest

measurement, the value of the largest measurement, and other facts about the arrayed

data that might be needed in a hurry. We illustrate the construction of an ordered array with

the data discussed in Example 1.4.1.

EXAMPLE 2.2.1

Table 1.4.1 contains a list of the ages of subjects who participated in the study on smoking

cessation discussed in Example 1.4.1. As can be seen, this unordered table requires

considerable searching for us to ascertain such elementary information as the age of the

youngest and oldest subjects.

Solution: Table 2.2.1 presents the data of Table 1.4.1 in the formof an ordered array. By

referring to Table 2.2.1 we are able to determine quickly the age of the

youngest subject (30) and the age of the oldest subject (82). We also readily

note that about one-third of the subjects are 50 years of age or younger.

20 CHAPTER 2 DESCRIPTIVE STATISTICS

3GC02 11/07/2012 21:58:59 Page 21

&

Computer Analysis If additional computations and organization of a data set

have to be done by hand, the work may be facilitated by working from an ordered array. If

the data are to be analyzed by a computer, it may be undesirable to prepare an ordered array,

unless one is needed for reference purposes or for some other use. A computer does not

need for its user to first construct an ordered array before entering data for the construction

of frequency distributions and the performance of other analyses. However, almost all

computer statistical packages and spreadsheet programs contain a routine for sorting data

in either an ascending or descending order. See Figure 2.2.1, for example.

TABLE 2.2.1 Ordered Array of Ages of Subjects from Table 1.4.1

30 34 35 37 37 38 38 38 38 39 39 40 40 42 42

43 43 43 43 43 43 44 44 44 44 44 44 44 45 45

45 46 46 46 46 46 46 47 47 47 47 47 47 48 48

48 48 48 48 48 49 49 49 49 49 49 49 50 50 50

50 50 50 50 50 51 51 51 51 52 52 52 52 52 52

53 53 53 53 53 53 53 53 53 53 53 53 53 53 53

53 53 54 54 54 54 54 54 54 54 54 54 54 55 55

55 56 56 56 56 56 56 57 57 57 57 57 57 57 58

58 59 59 59 59 59 59 60 60 60 60 61 61 61 61

61 61 61 61 61 61 61 62 62 62 62 62 62 62 63

63 64 64 64 64 64 64 65 65 66 66 66 66 66 66

67 68 68 68 69 69 69 70 71 71 71 71 71 71 71

72 73 75 76 77 78 78 78 82

Dialog box:

Data

Session command:

Sort MTB > Sort C1 C2;

SUBC> By C1.

FIGURE 2.2.1 MINITAB dialog box for Example 2.2.1.

2.2 THE ORDERED ARRAY 21

3GC02 11/07/2012 21:58:59 Page 22

2.3 GROUPEDDATA: THE

FREQUENCY DISTRIBUTION

Although a set of observations can be made more comprehensible and meaningful by

means of an ordered array, further useful summarization may be achieved by grouping the

data. Before the days of computers one of the main objectives in grouping large data sets

was to facilitate the calculation of various descriptive measures such as percentages and

averages. Because computers can perform these calculations on large data sets without first

grouping the data, the main purpose in grouping data nowis summarization. One must bear

in mind that data contain information and that summarization is a way of making it easier to

determine the nature of this information. One must also be aware that reducing a large

quantity of information in order to summarize the data succinctly carries with it the

potential to inadvertently lose some amount of specificity with regard to the underlying

data set. Therefore, it is important to group the data sufficiently such that the vast amounts

of information are reduced into understandable summaries. At the same time data should

be summarized to the extent that useful intricacies in the data are not readily obvious.

To group a set of observations we select a set of contiguous, nonoverlapping intervals

such that each value in the set of observations can be placed in one, and only one, of the

intervals. These intervals are usually referred to as class intervals.

One of the first considerations when data are to be grouped is how many intervals to

include. Too few intervals are undesirable because of the resulting loss of information. On

the other hand, if too many intervals are used, the objective of summarization will not be

met. The best guide to this, as well as to other decisions to be made in grouping data, is your

knowledge of the data. It may be that class intervals have been determined by precedent, as

in the case of annual tabulations, when the class intervals of previous years are maintained

for comparative purposes. A commonly followed rule of thumb states that there should be

no fewer than five intervals and no more than 15. If there are fewer than five intervals, the

data have been summarized too much and the information they contain has been lost. If

there are more than 15 intervals, the data have not been summarized enough.

Those who need more specific guidance in the matter of deciding how many class

intervals to employ may use a formula given by Sturges (1). This formula gives

k = 1 ÷ 3:322 log

10

n ( ), where k stands for the number of class intervals and n is the

number of values in the data set under consideration. The answer obtained by applying

Sturges’s rule should not be regarded as final, but should be considered as a guide only. The

number of class intervals specified by the rule should be increased or decreased for

convenience and clear presentation.

Suppose, for example, that we have a sample of 275 observations that we want to

group. The logarithm to the base 10 of 275 is 2.4393. Applying Sturges’s formula gives

k = 1 ÷ 3:322 2:4393 ( ) ’ 9. In practice, other considerations might cause us to use eight

or fewer or perhaps 10 or more class intervals.

Another question that must be decided regards the width of the class intervals. Class

intervals generally should be of the same width, although this is sometimes impossible to

accomplish. This width may be determined by dividing the range by k, the number of class

intervals. Symbolically, the class interval width is given by

w =

R

k

(2.3.1)

22 CHAPTER 2 DESCRIPTIVE STATISTICS

3GC02 11/07/2012 21:58:59 Page 23

where R (the range) is the difference between the smallest and the largest observation in the

data set, and k is defined as above. As a rule this procedure yields a width that is

inconvenient for use. Again, we may exercise our good judgment and select a width

(usually close to one given by Equation 2.3.1) that is more convenient.

There are other rules of thumb that are helpful in setting up useful class intervals.

When the nature of the data makes them appropriate, class interval widths of 5 units, 10

units, and widths that are multiples of 10 tend to make the summarization more

comprehensible. When these widths are employed it is generally good practice to have

the lower limit of each interval end in a zero or 5. Usually class intervals are ordered from

smallest to largest; that is, the first class interval contains the smaller measurements and the

last class interval contains the larger measurements. When this is the case, the lower limit

of the first class interval should be equal to or smaller than the smallest measurement in the

data set, and the upper limit of the last class interval should be equal to or greater than the

largest measurement.

Most statistical packages allow users to interactively change the number of class

intervals and/or the class widths, so that several visualizations of the data can be obtained

quickly. This feature allows users to exercise their judgment in deciding which data display

is most appropriate for a given purpose. Let us use the 189 ages shown in Table 1.4.1 and

arrayed in Table 2.2.1 to illustrate the construction of a frequency distribution.

EXAMPLE 2.3.1

We wish to know how many class intervals to have in the frequency distribution of the data.

We also want to know how wide the intervals should be.

Solution: To get an idea as to the number of class intervals to use, we can apply

Sturges’s rule to obtain

k = 1 ÷ 3:322 log 189 ( )

= 1 ÷ 3:322 2:2764618 ( )

~ 9

Now let us divide the range by 9 to get some idea about the class

interval width. We have

R

k

=

82 ÷ 30

9

=

52

9

= 5:778

It is apparent that a class interval width of 5 or 10 will be more

convenient to use, as well as more meaningful to the reader. Suppose we

decide on 10. We may nowconstruct our intervals. Since the smallest value in

Table 2.2.1 is 30 and the largest value is 82, we may begin our intervals with

30 and end with 89. This gives the following intervals:

30–39

40–49

50–59

60–69

2.3 GROUPED DATA: THE FREQUENCY DISTRIBUTION 23

3GC02 11/07/2012 21:58:59 Page 24

70–79

80–89

We see that there are six of these intervals, three fewer than the number

suggested by Sturges’s rule.

It is sometimes useful to refer to the center, called the midpoint, of a

class interval. The midpoint of a class interval is determined by obtaining the

sum of the upper and lower limits of the class interval and dividing by 2.

Thus, for example, the midpoint of the class interval 30–39 is found to be

30 ÷ 39 ( )=2 = 34:5. &

When we group data manually, determining the number of values falling into each

class interval is merely a matter of looking at the ordered array and counting the number

of observations falling in the various intervals. When we do this for our example, we

have Table 2.3.1.

A table such as Table 2.3.1 is called a frequency distribution. This table shows the

way in which the values of the variable are distributed among the specified class intervals.

By consulting it, we can determine the frequency of occurrence of values within any one of

the class intervals shown.

Relative Frequencies It may be useful at times to know the proportion, rather

than the number, of values falling within a particular class interval. We obtain this

information by dividing the number of values in the particular class interval by the total

number of values. If, in our example, we wish to know the proportion of values between 50

and 59, inclusive, we divide 70 by 189, obtaining .3704. Thus we say that 70 out of 189, or

70/189ths, or .3704, of the values are between 50 and 59. Multiplying .3704 by 100 gives us

the percentage of values between 50 and 59. We can say, then, that 37.04 percent of the

subjects are between 50 and 59 years of age. We may refer to the proportion of values

falling within a class interval as the relative frequency of occurrence of values in that

interval. In Section 3.2 we shall see that a relative frequency may be interpreted also as the

probability of occurrence within the given interval. This probability of occurrence is also

called the experimental probability or the empirical probability.

TABLE 2.3.1 Frequency Distribution of

Ages of 189 Subjects Shown in Tables 1.4.1

and 2.2.1

Class Interval Frequency

30–39 11

40–49 46

50–59 70

60–69 45

70–79 16

80–89 1

Total 189

24 CHAPTER 2 DESCRIPTIVE STATISTICS

3GC02 11/07/2012 21:59:0 Page 25

In determining the frequency of values falling within two or more class intervals, we

obtain the sum of the number of values falling within the class intervals of interest.

Similarly, if we want to know the relative frequency of occurrence of values falling within

two or more class intervals, we add the respective relative frequencies. We may sum, or

cumulate, the frequencies and relative frequencies to facilitate obtaining information

regarding the frequency or relative frequency of values within two or more contiguous

class intervals. Table 2.3.2 shows the data of Table 2.3.1 along with the cumulative

frequencies, the relative frequencies, and cumulative relative frequencies.

Suppose that we are interested in the relative frequency of values between 50 and 79.

We use the cumulative relative frequency column of Table 2.3.2 and subtract .3016 from

.9948, obtaining .6932.

We may use a statistical package to obtain a table similar to that shown in Table 2.3.2.

Tables obtained from both MINITAB and SPSS software are shown in Figure 2.3.1.

The Histogram We may display a frequency distribution (or a relative frequency

distribution) graphically in the form of a histogram, which is a special type of bar graph.

When we construct a histogram the values of the variable under consideration are

represented by the horizontal axis, while the vertical axis has as its scale the frequency (or

relative frequency if desired) of occurrence. Above each class interval on the horizontal

axis a rectangular bar, or cell, as it is sometimes called, is erected so that the height

corresponds to the respective frequency when the class intervals are of equal width. The

cells of a histogram must be joined and, to accomplish this, we must take into account the

true boundaries of the class intervals to prevent gaps from occurring between the cells of

our graph.

The level of precision observed in reported data that are measured on a continuous

scale indicates some order of rounding. The order of rounding reflects either the reporter’s

personal preference or the limitations of the measuring instrument employed. When a

frequency distribution is constructed from the data, the class interval limits usually reflect

the degree of precision of the raw data. This has been done in our illustrative example.

TABLE 2.3.2 Frequency, Cumulative Frequency, Relative Frequency,

and Cumulative Relative Frequency Distributions of the Ages of Subjects

Described in Example 1.4.1

Class

Interval Frequency

Cumulative

Frequency

Relative

Frequency

Cumulative

Relative

Frequency

30–39 11 11 .0582 .0582

40–49 46 57 .2434 .3016

50–59 70 127 .3704 .6720

60–69 45 172 .2381 .9101

70–79 16 188 .0847 .9948

80–89 1 189 .0053 1.0001

Total 189 1.0001

Note: Frequencies do not add to 1.0000 exactly because of rounding.

2.3 GROUPED DATA: THE FREQUENCY DISTRIBUTION 25

3GC02 11/07/2012 21:59:0 Page 26

We know, however, that some of the values falling in the second class interval, for example,

when measured precisely, would probably be a little less than 40 and some would be a little

greater than 49. Considering the underlying continuity of our variable, and assuming that

the data were rounded to the nearest whole number, we find it convenient to think of 39.5

and 49.5 as the true limits of this second interval. The true limits for each of the class

intervals, then, we take to be as shown in Table 2.3.3.

If we construct a graph using these class limits as the base of our rectangles, no gaps

will result, and we will have the histogram shown in Figure 2.3.2. We used MINITAB to

construct this histogram, as shown in Figure 2.3.3.

We refer to the space enclosed by the boundaries of the histogram as the area of the

histogram. Each observation is allotted one unit of this area. Since we have 189

observations, the histogram consists of a total of 189 units. Each cell contains a certain

proportion of the total area, depending on the frequency. The second cell, for example,

contains 46/189 of the area. This, as we have learned, is the relative frequency of

occurrence of values between 39.5 and 49.5. From this we see that subareas of the

histogram defined by the cells correspond to the frequencies of occurrence of values

between the horizontal scale boundaries of the areas. The ratio of a particular subarea to the

total area of the histogram is equal to the relative frequency of occurrence of values

between the corresponding points on the horizontal axis.

: d n a m m o c n o i s s e S : x o b g o l a i D

Stat Tables Tally Individual Variables MTB > Tally C2;

SUBC> Counts;

Type C2 in Variables. Check Counts, Percents, SUBC> CumCounts;

Cumulative counts, and Cumulative percents in SUBC> Percents;

Display. Click OK. SUBC> CumPercents;

Output:

Tally for Discrete Variables: C2

t u p t u O S S P S t u p t u O B A T I N I M

C2 Count CumCnt Percent CumPct

0 11 11 5.82 5.82

1 46 57 24.34 30.16

2 70 127 37.04 67.20

3 45 172 23.81 91.01

4 16 188 8.47 99.47

5 1 189 0.53 100.00

N= 189

Valid Cumulative

Frequency Percent Percent Percent

Valid 30-39 11 5.8 5.8 5.8

40-49 46 24.3 24.3 30.2

50-59 70 37.0 37.0 67.2

60-69 45 23.8 23.8 91.0

70-79 16 8.5 8.5 99.5

80-89 1 .5 .5 100.0

Total 189 100.0 100.0

FIGURE 2.3.1 Frequency, cumulative frequencies, percent, and cumulative percent

distribution of the ages of subjects described in Example 1.4.1 as constructed by MINITAB and

SPSS.

26 CHAPTER 2 DESCRIPTIVE STATISTICS

3GC02 11/07/2012 21:59:0 Page 27

The Frequency Polygon A frequency distribution can be portrayed graphically

in yet another way by means of a frequency polygon, which is a special kind of line graph.

To draw a frequency polygon we first place a dot above the midpoint of each class interval

represented on the horizontal axis of a graph like the one shown in Figure 2.3.2. The height

of a given dot above the horizontal axis corresponds to the frequency of the relevant class

interval. Connecting the dots by straight lines produces the frequency polygon. Figure 2.3.4

is the frequency polygon for the age data in Table 2.2.1.

Note that the polygon is brought down to the horizontal axis at the ends at points that

would be the midpoints if there were an additional cell at each end of the corresponding

histogram. This allows for the total area to be enclosed. The total area under the frequency

polygon is equal to the area under the histogram. Figure 2.3.5 shows the frequency polygon

of Figure 2.3.4 superimposed on the histogram of Figure 2.3.2. This figure allows you to

see, for the same set of data, the relationship between the two graphic forms.

TABLE 2.3.3 The Data of

Table 2.3.1 Showing True Class

Limits

True Class Limits Frequency

29.5–39.5 11

39.5–49.5 46

49.5–59.5 70

59.5–69.5 45

69.5–79.5 16

79.5–89.5 1

Total 189

34.5 44.5 54.5 64.5 74.5 84.5

Age

0

10

20

30

40

50

60

70

F

r

e

q

u

e

n

c

y

FIGURE 2.3.2 Histogram of ages of

189 subjects from Table 2.3.1.

: d n a m m o c n o i s s e S : x o b g o l a i D

Graph Histogram Simple OK MTB > Histogram 'Age';

SUBC> MidPoint 34.5:84.5/10;

Type Age in Graph Variables: Click OK. SUBC> Bar.

Now double click the histogram and click Binning Tab.

Type 34.5:84.5/10 in MidPoint/CutPoint positions:

Click OK.

FIGURE 2.3.3 MINITAB dialog box and session command for constructing histogram from

data on ages in Example 1.4.1.

2.3 GROUPED DATA: THE FREQUENCY DISTRIBUTION 27

3GC02 11/07/2012 21:59:0 Page 28

Stem-and-Leaf Displays Another graphical device that is useful for represent-

ing quantitative data sets is the stem-and-leaf display. A stem-and-leaf display bears a

strong resemblance to a histogram and serves the same purpose. A properly constructed

stem-and-leaf display, like a histogram, provides information regarding the range of the

data set, shows the location of the highest concentration of measurements, and reveals the

presence or absence of symmetry. An advantage of the stem-and-leaf display over the

histogram is the fact that it preserves the information contained in the individual

measurements. Such information is lost when measurements are assigned to the class

intervals of a histogram. As will become apparent, another advantage of stem-and-leaf

displays is the fact that they can be constructed during the tallying process, so the

intermediate step of preparing an ordered array is eliminated.

To construct a stem-and-leaf display we partition each measurement into two parts.

The first part is called the stem, and the second part is called the leaf. The stem consists of

one or more of the initial digits of the measurement, and the leaf is composed of one or

more of the remaining digits. All partitioned numbers are shown together in a single

display; the stems form an ordered column with the smallest stem at the top and the largest

at the bottom. We include in the stem column all stems within the range of the data even

when a measurement with that stem is not in the data set. The rows of the display contain

the leaves, ordered and listed to the right of their respective stems. When leaves consist of

more than one digit, all digits after the first may be deleted. Decimals when present in the

original data are omitted in the stem-and-leaf display. The stems are separated from their

leaves by a vertical line. Thus we see that a stem-and-leaf display is also an ordered array of

the data.

Stem-and-leaf displays are most effective with relatively small data sets. As a rule

they are not suitable for use in annual reports or other communications aimed at the general

public. They are primarily of value in helping researchers and decision makers understand

the nature of their data. Histograms are more appropriate for externally circulated

publications. The following example illustrates the construction of a stem-and-leaf display.

0

10

20

30

40

50

60

70

F

r

e

q

u

e

n

c

y

74.5 84.5 94.5 24.5 34.5 44.5 54.5 64.5

Age

FIGURE 2.3.4 Frequency polygon for the ages of

189 subjects shown in Table 2.2.1.

0

10

20

30

40

50

60

70

F

r

e

q

u

e

n

c

y

74.5 84.5 94.5 24.5 34.5 44.5 54.5 64.5

Age

FIGURE 2.3.5 Histogram and frequency polygon

for the ages of 189 subjects shown in Table 2.2.1.

28 CHAPTER 2 DESCRIPTIVE STATISTICS

3GC02 11/07/2012 21:59:0 Page 29

EXAMPLE 2.3.2

Let us use the age data shown in Table 2.2.1 to construct a stem-and-leaf display.

Solution: Since the measurements are all two-digit numbers, we will have one-digit

stems and one-digit leaves. For example, the measurement 30 has a stem of 3

and a leaf of 0. Figure 2.3.6 shows the stem-and-leaf display for the data.

The MINITAB statistical software package may be used to construct

stem-and-leaf displays. The MINITAB procedure and output are as shown in

Figure 2.3.7. The increment subcommand specifies the distance from one

stem to the next. The numbers in the leftmost output column of Figure 2.3.7

Stem Leaf

3 04577888899

4 0022333333444444455566666677777788888889999999

5 0000000011112222223333333333333333344444444444555666666777777788999999

6 000011111111111222222233444444556666667888999

7 0111111123567888

8 2

FIGURE 2.3.6 Stem-and-leaf display of ages of 189 subjects shown in Table 2.2.1 (stem

unit = 10, leaf unit = 1).

: d n a m m o c n o i s s e S : x o b g o l a i D

Graph Stem-and-Leaf MTB > Stem-and-Leaf 'Age';

SUBC> Increment 10.

Type Age in Graph Variables. Type 10 in Increment.

Click OK.

Output:

Stem-and-Leaf Display: Age

Stem-and-leaf of Age N = 189

Leaf Unit = 1.0

11 3 04577888899

57 4 0022333333444444455566666677777788888889999999

(70) 5 00000000111122222233333333333333333444444444445556666667777777889+

62 6 000011111111111222222233444444556666667888999

17 7 0111111123567888

1 8 2

FIGURE 2.3.7 Stem-and-leaf display prepared by MINITAB from the data on subjects’ ages

shown in Table 2.2.1.

2.3 GROUPED DATA: THE FREQUENCY DISTRIBUTION 29

3GC02 11/07/2012 21:59:0 Page 30

provide information regarding the number of observations (leaves) on a given

line and above or the number of observations on a given line and below. For

example, the number 57 on the second line shows that there are 57

observations (or leaves) on that line and the one above it. The number 62

on the fourth line from the top tells us that there are 62 observations on that

line and all the ones below. The number in parentheses tells us that there are

70 observations on that line. The parentheses mark the line containing the

middle observation if the total number of observations is odd or the two

middle observations if the total number of observations is even.

The ÷ at the end of the third line in Figure 2.3.7 indicates that the

frequency for that line (age group 50 through 59) exceeds the line capacity,

and that there is at least one additional leaf that is not shown. In this case, the

frequency for the 50–59 age group was 70. The line contains only 65 leaves,

so the ÷ indicates that there are five more leaves, the number 9, that are not

shown. &

One way to avoid exceeding the capacity of a line is to have more lines. This is

accomplished by making the distance between lines shorter, that is, by decreasing the

widths of the class intervals. For the present example, we may use class interval widths of 5,

so that the distance between lines is 5. Figure 2.3.8 shows the result when MINITABis used

to produce the stem-and-leaf display.

EXERCISES

2.3.1 In a study of the oral home care practice and reasons for seeking dental care among individuals on

renal dialysis, Atassi (A-1) studied 90 subjects on renal dialysis. The oral hygiene status of all

subjects was examined using a plaque index with a range of 0 to 3 (0 = no soft plaque deposits,

Stem-and-leaf of Age N = 189

Leaf Unit = 1.0

2 3 04

11 3 577888899

28 4 00223333334444444

57 4 55566666677777788888889999999

(46) 5 0000000011112222223333333333333333344444444444

86 5 555666666777777788999999

62 6 000011111111111222222233444444

32 6 556666667888999

17 7 0111111123

7 7 567888

1 8 2

FIGURE 2.3.8 Stem-and-leaf display prepared by MINITAB from the data on subjects’ ages

shown in Table 2.2.1; class interval width = 5.

30 CHAPTER 2 DESCRIPTIVE STATISTICS

3GC02 11/07/2012 21:59:1 Page 31

3 = an abundance of soft plaque deposits). The following table shows the plaque index scores for all

90 subjects.

1.17 2.50 2.00 2.33 1.67 1.33

1.17 2.17 2.17 1.33 2.17 2.00

2.17 1.17 2.50 2.00 1.50 1.50

1.00 2.17 2.17 1.67 2.00 2.00

1.33 2.17 2.83 1.50 2.50 2.33

0.33 2.17 1.83 2.00 2.17 2.00

1.00 2.17 2.17 1.33 2.17 2.50

0.83 1.17 2.17 2.50 2.00 2.50

0.50 1.50 2.00 2.00 2.00 2.00

1.17 1.33 1.67 2.17 1.50 2.00

1.67 0.33 1.50 2.17 2.33 2.33

1.17 0.00 1.50 2.33 1.83 2.67

0.83 1.17 1.50 2.17 2.67 1.50

2.00 2.17 1.33 2.00 2.33 2.00

2.17 2.17 2.00 2.17 2.00 2.17

Source: Data provided courtesy of Farhad

Atassi, DDS, MSc, FICOI.

(a) Use these data to prepare:

A frequency distribution

A relative frequency distribution

A cumulative frequency distribution

A cumulative relative frequency distribution

A histogram

A frequency polygon

(b) What percentage of the measurements are less than 2.00?

(c) What proportion of the subjects have measurements greater than or equal to 1.50?

(d) What percentage of the measurements are between 1.50 and 1.99 inclusive?

(e) How many of the measurements are greater than 2.49?

(f) What proportion of the measurements are either less than 1.0 or greater than 2.49?

(g) Someone picks a measurement at random from this data set and asks you to guess the value.

What would be your answer? Why?

(h) Frequency distributions and their histograms may be described in a number of ways depending

on their shape. For example, they may be symmetric (the left half is at least approximately a mirror

image of the right half), skewed to the left (the frequencies tend to increase as the measurements

increase in size), skewed to the right (the frequencies tend to decrease as the measurements increase

in size), or U-shaped (the frequencies are high at each end of the distribution and small in the center).

How would you describe the present distribution?

2.3.2 Janardhan et al. (A-2) conducted a study in which they measured incidental intracranial aneurysms

(IIAs) in 125 patients. The researchers examined postprocedural complications and concluded that

IIAs can be safely treated without causing mortality and with a lower complications rate than

previously reported. The following are the sizes (in millimeters) of the 159 IIAs in the sample.

8.1 10.0 5.0 7.0 10.0 3.0

20.0 4.0 4.0 6.0 6.0 7.0

(Continued )

EXERCISES 31

3GC02 11/07/2012 21:59:2 Page 32

10.0 4.0 3.0 5.0 6.0 6.0

6.0 6.0 6.0 5.0 4.0 5.0

6.0 25.0 10.0 14.0 6.0 6.0

4.0 15.0 5.0 5.0 8.0 19.0

21.0 8.3 7.0 8.0 5.0 8.0

5.0 7.5 7.0 10.0 15.0 8.0

10.0 3.0 15.0 6.0 10.0 8.0

7.0 5.0 10.0 3.0 7.0 3.3

15.0 5.0 5.0 3.0 7.0 8.0

3.0 6.0 6.0 10.0 15.0 6.0

3.0 3.0 7.0 5.0 4.0 9.2

16.0 7.0 8.0 5.0 10.0 10.0

9.0 5.0 5.0 4.0 8.0 4.0

3.0 4.0 5.0 8.0 30.0 14.0

15.0 2.0 8.0 7.0 12.0 4.0

3.8 10.0 25.0 8.0 9.0 14.0

30.0 2.0 10.0 5.0 5.0 10.0

22.0 5.0 5.0 3.0 4.0 8.0

7.5 5.0 8.0 3.0 5.0 7.0

8.0 5.0 9.0 11.0 2.0 10.0

6.0 5.0 5.0 12.0 9.0 8.0

15.0 18.0 10.0 9.0 5.0 6.0

6.0 8.0 12.0 10.0 5.0

5.0 16.0 8.0 5.0 8.0

4.0 16.0 3.0 7.0 13.0

Source: Data provided courtesy of

Vallabh Janardhan, M.D.

(a) Use these data to prepare:

A frequency distribution

A relative frequency distribution

A cumulative frequency distribution

A cumulative relative frequency distribution

A histogram

A frequency polygon

(b) What percentage of the measurements are between 10 and 14.9 inclusive?

(c) How many observations are less than 20?

(d) What proportion of the measurements are greater than or equal to 25?

(e) What percentage of the measurements are either less than 10.0 or greater than 19.95?

(f) Refer to Exercise 2.3.1, part h. Describe the distribution of the size of the aneurysms in this sample.

2.3.3 Hoekema et al. (A-3) studied the craniofacial morphology of patients diagnosed with obstructive

sleep apnea syndrome (OSAS) in healthy male subjects. One of the demographic variables the

researchers collected for all subjects was the Body Mass Index (calculated by dividing weight in kg

by the square of the patient’s height in cm). The following are the BMI values of 29 OSAS subjects.

33.57 27.78 40.81

38.34 29.01 47.78

26.86 54.33 28.99

(Continued )

32 CHAPTER 2 DESCRIPTIVE STATISTICS

3GC02 11/07/2012 21:59:2 Page 33

25.21 30.49 27.38

36.42 41.50 29.39

24.54 41.75 44.68

24.49 33.23 47.09

29.07 28.21 42.10

26.54 27.74 33.48

31.44 30.08

Source: Data provided courtesy

of A. Hoekema, D.D.S.

(a) Use these data to construct:

A frequency distribution

A relative frequency distribution

A cumulative frequency distribution

A cumulative relative frequency distribution

A histogram

A frequency polygon

(b) What percentage of the measurements are less than 30?

(c) What percentage of the measurements are between 40.0 and 49.99 inclusive?

(d) What percentage of the measurements are greater than 34.99?

(e) Describe these data with respect to symmetry and skewness as discussed in Exercise 2.3.1, part h.

(f) How many of the measurements are less than 40?

2.3.4 David Holben (A-4) studied selenium levels in beef raised in a low selenium region of the United

States. The goal of the study was to compare selenium levels in the region-raised beef to selenium

levels in cooked venison, squirrel, and beef from other regions of the United States. The data below

are the seleniumlevels calculated on a dry weight basis in mg=100 g for a sample of 53 region-raised

cattle.

11.23 15.82

29.63 27.74

20.42 22.35

10.12 34.78

39.91 35.09

32.66 32.60

38.38 37.03

36.21 27.00

16.39 44.20

27.44 13.09

17.29 33.03

56.20 9.69

28.94 32.45

20.11 37.38

25.35 34.91

21.77 27.99

31.62 22.36

32.63 22.68

30.31 26.52

46.16 46.01

(Continued )

EXERCISES 33

3GC02 11/07/2012 21:59:3 Page 34

56.61 38.04

24.47 30.88

29.39 30.04

40.71 25.91

18.52 18.54

27.80 25.51

19.49

Source: Data provided courtesy

of David Holben, Ph.D.

(a) Use these data to construct:

A frequency distribution

A relative frequency distribution

A cumulative frequency distribution

A cumulative relative frequency distribution

A histogram

A frequency polygon

(b) Describe these data with respect to symmetry and skewness as discussed in Exercise 2.3.1, part h.

(c) How many of the measurements are greater than 40?

(d) What percentage of the measurements are less than 25?

2.3.5 The following table shows the number of hours 45 hospital patients slept following the administration

of a certain anesthetic.

7 10 12 4 8 7 3 8 5

12 11 3 8 1 1 13 10 4

4 5 5 8 7 7 3 2 3

8 13 1 7 17 3 4 5 5

3 1 17 10 4 7 7 11 8

(a) From these data construct:

A frequency distribution

A relative frequency distribution

A histogram

A frequency polygon

(b) Describe these data relative to symmetry and skewness as discussed in Exercise 2.3.1, part h.

2.3.6 The following are the number of babies born during a year in 60 community hospitals.

30 55 27 45 56 48 45 49 32 57 47 56

37 55 52 34 54 42 32 59 35 46 24 57

32 26 40 28 53 54 29 42 42 54 53 59

39 56 59 58 49 53 30 53 21 34 28 50

52 57 43 46 54 31 22 31 24 24 57 29

(a) From these data construct:

A frequency distribution

A relative frequency distribution

A histogram

A frequency polygon

(b) Describe these data relative to symmetry and skewness as discussed in Exercise 2.3.1, part h.

34 CHAPTER 2 DESCRIPTIVE STATISTICS

3GC02 11/07/2012 21:59:4 Page 35

2.3.7 In a study of physical endurance levels of male college freshman, the following composite endurance

scores based on several exercise routines were collected.

254 281 192 260 212 179 225 179 181 149

182 210 235 239 258 166 159 223 186 190

180 188 135 233 220 204 219 211 245 151

198 190 151 157 204 238 205 229 191 200

222 187 134 193 264 312 214 227 190 212

165 194 206 193 218 198 241 149 164 225

265 222 264 249 175 205 252 210 178 159

220 201 203 172 234 198 173 187 189 237

272 195 227 230 168 232 217 249 196 223

232 191 175 236 152 258 155 215 197 210

214 278 252 283 205 184 172 228 193 130

218 213 172 159 203 212 117 197 206 198

169 187 204 180 261 236 217 205 212 218

191 124 199 235 139 231 116 182 243 217

251 206 173 236 215 228 183 204 186 134

188 195 240 163 208

(a) From these data construct:

A frequency distribution

A relative frequency distribution

A frequency polygon

A histogram

(b) Describe these data relative to symmetry and skewness as discussed in Exercise 2.3.1, part h.

2.3.8 The following are the ages of 30 patients seen in the emergency room of a hospital on a Friday night.

Construct a stem-and-leaf display from these data. Describe these data relative to symmetry and

skewness as discussed in Exercise 2.3.1, part h.

35 32 21 43 39 60

36 12 54 45 37 53

45 23 64 10 34 22

36 45 55 44 55 46

22 38 35 56 45 57

2.3.9 The following are the emergency room charges made to a sample of 25 patients at two city hospitals.

Construct a stem-and-leaf display for each set of data. What does a comparison of the two displays

suggest regarding the two hospitals? Describe the two sets of data with respect to symmetry and

skewness as discussed in Exercise 2.3.1, part h.

Hospital A

249.10 202.50 222.20 214.40 205.90

214.30 195.10 213.30 225.50 191.40

201.20 239.80 245.70 213.00 238.80

171.10 222.00 212.50 201.70 184.90

248.30 209.70 233.90 229.80 217.90

EXERCISES 35

3GC02 11/07/2012 21:59:5 Page 36

Hospital B

199.50 184.00 173.20 186.00 214.10

125.50 143.50 190.40 152.00 165.70

154.70 145.30 154.60 190.30 135.40

167.70 203.40 186.70 155.30 195.90

168.90 166.70 178.60 150.20 212.40

2.3.10 Refer to the ages of patients discussed in Example 1.4.1 and displayed in Table 1.4.1.

(a) Use class interval widths of 5 and construct:

A frequency distribution

A relative frequency distribution

A cumulative frequency distribution

A cumulative relative frequency distribution

A histogram

A frequency polygon

(b) Describe these data with respect to symmetry and skewness as discussed in Exercise 2.3.1, part h.

2.3.11 The objectives of a study by Skjelbo et al. (A-5) were to examine (a) the relationship between

chloroguanide metabolism and efficacy in malaria prophylaxis and (b) the mephenytoin metabolism

and its relationship to chloroguanide metabolism among Tanzanians. From information provided

by urine specimens from the 216 subjects, the investigators computed the ratio of unchanged

S-mephenytoin to R-mephenytoin (S/R ratio). The results were as follows:

0.0269 0.0400 0.0550 0.0550 0.0650 0.0670 0.0700 0.0720

0.0760 0.0850 0.0870 0.0870 0.0880 0.0900 0.0900 0.0990

0.0990 0.0990 0.0990 0.0990 0.0990 0.0990 0.0990 0.0990

0.0990 0.0990 0.0990 0.0990 0.0990 0.0990 0.0990 0.0990

0.0990 0.0990 0.0990 0.0990 0.0990 0.0990 0.0990 0.0990

0.0990 0.0990 0.0990 0.0990 0.0990 0.1000 0.1020 0.1040

0.1050 0.1050 0.1080 0.1080 0.1090 0.1090 0.1090 0.1160

0.1190 0.1200 0.1230 0.1240 0.1340 0.1340 0.1370 0.1390

0.1460 0.1480 0.1490 0.1490 0.1500 0.1500 0.1500 0.1540

0.1550 0.1570 0.1600 0.1650 0.1650 0.1670 0.1670 0.1677

0.1690 0.1710 0.1720 0.1740 0.1780 0.1780 0.1790 0.1790

0.1810 0.1880 0.1890 0.1890 0.1920 0.1950 0.1970 0.2010

0.2070 0.2100 0.2100 0.2140 0.2150 0.2160 0.2260 0.2290

0.2390 0.2400 0.2420 0.2430 0.2450 0.2450 0.2460 0.2460

0.2470 0.2540 0.2570 0.2600 0.2620 0.2650 0.2650 0.2680

0.2710 0.2800 0.2800 0.2870 0.2880 0.2940 0.2970 0.2980

0.2990 0.3000 0.3070 0.3100 0.3110 0.3140 0.3190 0.3210

0.3400 0.3440 0.3480 0.3490 0.3520 0.3530 0.3570 0.3630

0.3630 0.3660 0.3830 0.3900 0.3960 0.3990 0.4080 0.4080

0.4090 0.4090 0.4100 0.4160 0.4210 0.4260 0.4290 0.4290

0.4300 0.4360 0.4370 0.4390 0.4410 0.4410 0.4430 0.4540

0.4680 0.4810 0.4870 0.4910 0.4980 0.5030 0.5060 0.5220

0.5340 0.5340 0.5460 0.5480 0.5480 0.5490 0.5550 0.5920

0.5930 0.6010 0.6240 0.6280 0.6380 0.6600 0.6720 0.6820

(Continued )

36 CHAPTER 2 DESCRIPTIVE STATISTICS

3GC02 11/07/2012 21:59:6 Page 37

0.6870 0.6900 0.6910 0.6940 0.7040 0.7120 0.7200 0.7280

0.7860 0.7950 0.8040 0.8200 0.8350 0.8770 0.9090 0.9520

0.9530 0.9830 0.9890 1.0120 1.0260 1.0320 1.0620 1.1600

Source: Data provided courtesy of Erik Skjelbo, M.D.

(a) From these data construct the following distributions: frequency, relative frequency, cumulative

frequency, and cumulative relative frequency; and the following graphs: histogram, frequency

polygon, and stem-and-leaf plot.

(b) Describe these data with respect to symmetry and skewness as discussed in Exercise 2.3.1, part h.

(c) The investigators defined as poor metabolizers of mephenytoin any subject with an S/ R

mephenytoin ratio greater than .9. How many and what percentage of the subjects were poor

metabolizers?

(d) How many and what percentage of the subjects had ratios less than .7? Between .3 and .6999

inclusive? Greater than .4999?

2.3.12 Schmidt et al. (A-6) conducted a study to investigate whether autotransfusion of shed mediastinal

blood could reduce the number of patients needing homologous blood transfusion and reduce the

amount of transfused homologous blood if fixed transfusion criteria were used. The following table

shows the heights in centimeters of the 109 subjects of whom 97 were males.

1.720 1.710 1.700 1.655 1.800 1.700

1.730 1.700 1.820 1.810 1.720 1.800

1.800 1.800 1.790 1.820 1.800 1.650

1.680 1.730 1.820 1.720 1.710 1.850

1.760 1.780 1.760 1.820 1.840 1.690

1.770 1.920 1.690 1.690 1.780 1.720

1.750 1.710 1.690 1.520 1.805 1.780

1.820 1.790 1.760 1.830 1.760 1.800

1.700 1.760 1.750 1.630 1.760 1.770

1.840 1.690 1.640 1.760 1.850 1.820

1.760 1.700 1.720 1.780 1.630 1.650

1.660 1.880 1.740 1.900 1.830

1.600 1.800 1.670 1.780 1.800

1.750 1.610 1.840 1.740 1.750

1.960 1.760 1.730 1.730 1.810

1.810 1.775 1.710 1.730 1.740

1.790 1.880 1.730 1.560 1.820

1.780 1.630 1.640 1.600 1.800

1.800 1.780 1.840 1.830

1.770 1.690 1.800 1.620

Source: Data provided courtesy of Erik Skjelbo, M.D.

(a) For these data construct the following distributions: frequency, relative frequency, cumulative

frequency, and cumulative relative frequency; and the following graphs: histogram, frequency

polygon, and stem-and-leaf plot.

(b) Describe these data with respect to symmetry and skewness as discussed in Exercise 2.3.1, part h.

(c) How do you account for the shape of the distribution of these data?

(d) How tall were the tallest 6.42 percent of the subjects?

(e) How tall were the shortest 10.09 percent of the subjects?

EXERCISES 37

3GC02 11/07/2012 21:59:6 Page 38

2.4 DESCRIPTIVE STATISTICS:

MEASURES OF CENTRAL TENDENCY

Although frequency distributions serve useful purposes, there are many situations that

require other types of data summarization. What we need in many instances is the ability to

summarize the data by means of a single number called a descriptive measure. Descriptive

measures may be computed from the data of a sample or the data of a population. To

distinguish between them we have the following definitions:

DEFINITIONS

1. Adescriptive measure computed fromthe data of a sample is called a

statistic.

2. A descriptive measure computed from the data of a population is

called a parameter.

Several types of descriptive measures can be computed from a set of data. In this

chapter, however, we limit discussion to measures of central tendency and measures of

dispersion. We consider measures of central tendency in this section and measures of

dispersion in the following one.

In each of the measures of central tendency, of which we discuss three, we have a

single value that is considered to be typical of the set of data as a whole. Measures of central

tendency convey information regarding the average value of a set of values. As we will see,

the word average can be defined in different ways.

The three most commonly used measures of central tendency are the mean, the

median, and the mode.

Arithmetic Mean The most familiar measure of central tendency is the arithmetic

mean. It is the descriptive measure most people have in mind when they speak of the

“average.” The adjective arithmetic distinguishes this mean from other means that can be

computed. Since we are not covering these other means in this book, we shall refer to the

arithmetic mean simply as the mean. The mean is obtained by adding all the values in a

population or sample and dividing by the number of values that are added.

EXAMPLE 2.4.1

We wish to obtain the mean age of the population of 189 subjects represented in Table 1.4.1.

Solution: We proceed as follows:

mean age =

48 ÷ 35 ÷ 46 ÷ ÷ 73 ÷ 66

189

= 55:032

&

The three dots in the numerator represent the values we did not show in order to save

space.

38 CHAPTER 2 DESCRIPTIVE STATISTICS

3GC02 11/07/2012 21:59:6 Page 39

General Formula for the Mean It will be convenient if we can generalize the

procedure for obtaining the mean and, also, represent the procedure in a more compact

notational form. Let us begin by designating the random variable of interest by the capital

letter X. In our present illustration we let X represent the random variable, age. Specific

values of a random variable will be designated by the lowercase letter x. To distinguish one

value from another, we attach a subscript to the x and let the subscript refer to the first, the

second, the third value, and so on. For example, from Table 1.4.1 we have

x

1

= 48; x

2

= 35; . . . ; x

189

= 66

In general, a typical value of a random variable will be designated by x

i

and the final value,

in a finite population of values, by x

N

, where N is the number of values in the population.

Finally, we will use the Greek letter m to stand for the population mean. We may now write

the general formula for a finite population mean as follows:

m =

P

N

i=1

x

i

N

(2.4.1)

The symbol

P

N

i=1

instructs us to add all values of the variable from the first to the last. This

symbol S, called the summation sign, will be used extensively in this book. When from the

context it is obvious which values are to be added, the symbols above and below S will be

omitted.

The Sample Mean When we compute the mean for a sample of values, the

procedure just outlined is followed with some modifications in notation. We use x to

designate the sample mean and n to indicate the number of values in the sample. The

sample mean then is expressed as

x =

P

n

i=1

x

i

n

(2.4.2)

EXAMPLE 2.4.2

In Chapter 1 we selected a simple random sample of 10 subjects from the population of

subjects represented in Table 1.4.1. Let us now compute the mean age of the 10 subjects in

our sample.

Solution: We recall (see Table 1.4.2) that the ages of the 10 subjects in our sample were

x

1

= 43; x

2

= 66; x

3

= 61; x

4

= 64; x

5

= 65; x

6

= 38; x

7

= 59; x

8

= 57;

x

9

= 57; x

10

= 50. Substitution of our sample data into Equation 2.4.2 gives

x =

P

n

i=1

x

i

n

=

43 ÷ 66 ÷ ÷ 50

10

= 56

&

2.4 DESCRIPTIVE STATISTICS: MEASURES OF CENTRAL TENDENCY 39

3GC02 11/07/2012 21:59:7 Page 40

Properties of the Mean The arithmetic mean possesses certain properties, some

desirable and some not so desirable. These properties include the following:

1. Uniqueness. For a given set of data there is one and only one arithmetic mean.

2. Simplicity. The arithmetic mean is easily understood and easy to compute.

3. Since each and every value in a set of data enters into the computation of the mean, it

is affected by each value. Extreme values, therefore, have an influence on the mean

and, in some cases, can so distort it that it becomes undesirable as a measure of

central tendency.

As an example of how extreme values may affect the mean, consider the following

situation. Suppose the five physicians who practice in an area are surveyed to determine

their charges for a certain procedure. Assume that they report these charges: $75, $75, $80,

$80, and $280. The mean charge for the five physicians is found to be $118, a value that is

not very representative of the set of data as a whole. The single atypical value had the effect

of inflating the mean.

Median The median of a finite set of values is that value which divides the set into

two equal parts such that the number of values equal to or greater than the median is

equal to the number of values equal to or less than the median. If the number of values is

odd, the median will be the middle value when all values have been arranged in order of

magnitude. When the number of values is even, there is no single middle value. Instead

there are two middle values. In this case the median is taken to be the mean of these two

middle values, when all values have been arranged in the order of their magnitudes. In

other words, the median observation of a data set is the n ÷ 1 ( )=2th one when the

observation have been ordered. If, for example, we have 11 observations, the median is

the 11 ÷ 1 ( )=2 = 6th ordered observation. If we have 12 observations the median is the

12 ÷ 1 ( )=2 = 6:5th ordered observation and is a value halfway between the 6th and 7th

ordered observations.

EXAMPLE 2.4.3

Let us illustrate by finding the median of the data in Table 2.2.1.

Solution: The values are already ordered so we need only to find the two middle values.

The middle value is the n ÷ 1 ( )=2 = 189 ÷ 1 ( )=2 = 190=2 = 95th one.

Counting from the smallest up to the 95th value we see that it is 54.

Thus the median age of the 189 subjects is 54 years. &

EXAMPLE 2.4.4

We wish to find the median age of the subjects represented in the sample described in

Example 2.4.2.

Solution: Arraying the 10 ages in order of magnitude from smallest to largest gives 38,

43, 50, 57, 57, 59, 61, 64, 65, 66. Since we have an even number of ages, there

40 CHAPTER 2 DESCRIPTIVE STATISTICS

3GC02 11/07/2012 21:59:7 Page 41

is no middle value. The two middle values, however, are 57 and 59. The

median, then, is 57 ÷ 59 ( )=2 = 58. &

Properties of the Median Properties of the median include the following:

1. Uniqueness. As is true with the mean, there is only one median for a given set of

data.

2. Simplicity. The median is easy to calculate.

3. It is not as drastically affected by extreme values as is the mean.

The Mode The mode of a set of values is that value which occurs most frequently. If

all the values are different there is no mode; on the other hand, a set of values may have

more than one mode.

EXAMPLE 2.4.5

Find the modal age of the subjects whose ages are given in Table 2.2.1.

Solution: A count of the ages in Table 2.2.1 reveals that the age 53 occurs most

frequently (17 times). The mode for this population of ages is 53. &

For an example of a set of values that has more than one mode, let us consider

a laboratory with 10 employees whose ages are 20, 21, 20, 20, 34, 22, 24, 27, 27,

and 27. We could say that these data have two modes, 20 and 27. The sample

consisting of the values 10, 21, 33, 53, and 54 has no mode since all the values are

different.

The mode may be used also for describing qualitative data. For example, suppose the

patients seen in a mental health clinic during a given year received one of the following

diagnoses: mental retardation, organic brain syndrome, psychosis, neurosis, and personal-

ity disorder. The diagnosis occurring most frequently in the group of patients would be

called the modal diagnosis.

An attractive property of a data distribution occurs when the mean, median, and

mode are all equal. The well-known “bell-shaped curve” is a graphical representation of

a distribution for which the mean, median, and mode are all equal. Much statistical

inference is based on this distribution, the most common of which is the normal

distribution. The normal distribution is introduced in Section 4.6 and discussed further

in subsequent chapters. Another common distribution of this type is the t-distribution,

which is introduced in Section 6.3.

Skewness Data distributions may be classified on the basis of whether they are

symmetric or asymmetric. If a distribution is symmetric, the left half of its graph

(histogram or frequency polygon) will be a mirror image of its right half. When the

left half and right half of the graph of a distribution are not mirror images of each other, the

distribution is asymmetric.

2.4 DESCRIPTIVE STATISTICS: MEASURES OF CENTRAL TENDENCY 41

3GC02 11/07/2012 21:59:7 Page 42

DEFINITION

If the graph (histogram or frequency polygon) of a distribution is

asymmetric, the distribution is said to be skewed . If a distribution is

not symmetric because its graph extends further to the right than to

the left, that is, if it has a long tail to the right, we say that the distribution

is skewed to the right or is positively skewed. If a distribution is not

symmetric because its graph extends further to the left than to the right,

that is, if it has a long tail to the left, we say that the distribution is

skewed to the left or is negatively skewed.

A distribution will be skewed to the right, or positively skewed, if its mean is greater

than its mode. A distribution will be skewed to the left, or negatively skewed, if its mean is

less than its mode. Skewness can be expressed as follows:

Skewness =

ﬃﬃﬃ

n

_ P

n

i=1

x

i

÷x ( )

3

P

n

i=1

x

i

÷x ( )

2

3=2

=

ﬃﬃﬃ

n

_ P

n

i=1

x

i

÷x ( )

3

n ÷ 1 ( )

ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

n ÷ 1

_

s

3

(2.4.3)

In Equation 2.4.3, s is the standard deviation of a sample as defined in Equation 2.5.4. Most

computer statistical packages include this statistic as part of a standard printout. Avalue of

skewness > 0 indicates positive skewness and a value of skewness < 0 indicates negative

skewness. An illustration of skewness is shown in Figure 2.4.1.

EXAMPLE 2.4.6

Consider the three distributions shown in Figure 2.4.1. Given that the histograms represent

frequency counts, the data can be easily re-created and entered into a statistical package.

For example, observation of the “No Skew” distribution would yield the following data:

5, 5, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8, 8, 9, 9, 9, 9, 10, 10, 10, 11, 11. Values can be obtained from

FIGURE 2.4.1 Three histograms illustrating skewness.

42 CHAPTER 2 DESCRIPTIVE STATISTICS

3GC02 11/07/2012 21:59:7 Page 43

the skewed distributions in a similar fashion. Using SPSS software, the following

descriptive statistics were obtained for these three distributions

No Skew Right Skew Left Skew

Mean 8.0000 6.6667 8.3333

Median 8.0000 6.0000 9.0000

Mode 8.00 5.00 10.00

Skewness .000 .627 ÷.627

&

2.5 DESCRIPTIVE STATISTICS:

MEASURES OF DISPERSION

The dispersion of a set of observations refers to the variety that they exhibit. A measure of

dispersion conveys information regarding the amount of variability present in a set of data.

If all the values are the same, there is no dispersion; if they are not all the same, dispersion is

present in the data. The amount of dispersion may be small when the values, though

different, are close together. Figure 2.5.1 shows the frequency polygons for two popula-

tions that have equal means but different amounts of variability. Population B, which is

more variable than population A, is more spread out. If the values are widely scattered, the

dispersion is greater. Other terms used synonymously with dispersion include variation,

spread, and scatter.

The Range One way to measure the variation in a set of values is to compute the

range. The range is the difference between the largest and smallest value in a set of

observations. If we denote the range by R, the largest value by x

L

, and the smallest value

by x

S

, we compute the range as follows:

R = x

L

÷ x

S

(2.5.1)

Population A

Population B

m

FIGURE 2.5.1 Two frequency distributions with equal means but different amounts

of dispersion.

2.5 DESCRIPTIVE STATISTICS: MEASURES OF DISPERSION 43

3GC02 11/07/2012 21:59:8 Page 44

EXAMPLE 2.5.1

We wish to compute the range of the ages of the sample subjects discussed in Table 2.2.1.

Solution: Since the youngest subject in the sample is 30 years old and the oldest is 82,

we compute the range to be

R = 82 ÷ 30 = 52 &

The usefulness of the range is limited. The fact that it takes into account only two values

causes it to be a poor measure of dispersion. The main advantage in using the range is the

simplicity of its computation. Since the range, expressed as a single measure, imparts

minimal information about a data set and therefore is of limited use, it is often preferable to

express the range as a number pair, x

S

; x

L

[ [, in which x

S

and x

L

are the smallest and largest

values in the data set, respectively. For the data in Example 2.5.1, we may express the range

as the number pair [30, 82]. Although this is not the traditional expression for the range, it is

intuitive to imagine that knowledge of the minimum and maximum values in this data set

would convey more information than knowing only that the range is equal to 52. An infinite

number of distributions, each with quite different minimum and maximum values, may

have a range of 52.

The Variance When the values of a set of observations lie close to their mean, the

dispersion is less than when they are scattered over a wide range. Since this is true, it would

be intuitively appealing if we could measure dispersion relative to the scatter of the values

about their mean. Such a measure is realized in what is known as the variance. In

computing the variance of a sample of values, for example, we subtract the mean fromeach

of the values, square the resulting differences, and then add up the squared differences. This

sum of the squared deviations of the values from their mean is divided by the sample size,

minus 1, to obtain the sample variance. Letting s

2

stand for the sample variance, the

procedure may be written in notational form as follows:

s

2

=

P

n

i=1

x

i

÷x ( )

2

n ÷ 1

(2.5.2)

It is therefore easy to see that the variance can be described as the average squared

deviation of individual values from the mean of that set. It may seem nonintuitive at this

stage that the differences in the numerator be squared. However, consider a symmetric

distribution. It is easy to imagine that if we compute the difference of each data point in the

distribution from the mean value, half of the differences would be positive and half would

be negative, resulting in a sum that would be zero. A variance of zero would be a

noninformative measure for any distribution of numbers except one in which all of the

values are the same. Therefore, the square of each difference is used to ensure a positive

numerator and hence a much more valuable measure of dispersion.

EXAMPLE 2.5.2

Let us illustrate by computing the variance of the ages of the subjects discussed in

Example 2.4.2.

44 CHAPTER 2 DESCRIPTIVE STATISTICS

3GC02 11/07/2012 21:59:8 Page 45

Solution:

s

2

=

43 ÷ 56 ( )

2

÷ 66 ÷ 56 ( )

2

÷ ÷ 50 ÷ 56 ( )

2

9

=

810

9

= 90

&

Degrees of Freedom The reason for dividing by n ÷ 1 rather than n, as we might

have expected, is the theoretical consideration referred to as degrees of freedom. In

computing the variance, we say that we have n ÷ 1 degrees of freedom. We reason as

follows. The sum of the deviations of the values from their mean is equal to zero, as can be

shown. If, then, we know the values of n ÷ 1 of the deviations from the mean, we know the

nth one, since it is automatically determined because of the necessity for all n values to add

to zero. From a practical point of view, dividing the squared differences by n ÷ 1 rather than

n is necessary in order to use the sample variance in the inference procedures discussed

later. The concept of degrees of freedom will be revisited in a later chapter. Students

interested in pursuing the matter further at this time should refer to the article by Walker (2).

When we compute the variance from a finite population of N values, the procedures

outlined above are followed except that we subtract m from each x and divide by N rather

than N ÷ 1. If we let s

2

stand for the finite population variance, the formula is as follows:

s

2

=

P

N

i=1

x

i

÷ m ( )

2

N

(2.5.3)

Standard Deviation The variance represents squared units and, therefore, is not

an appropriate measure of dispersion when we wish to express this concept in terms of the

original units. To obtain a measure of dispersion in original units, we merely take the square

root of the variance. The result is called the standard deviation. In general, the standard

deviation of a sample is given by

s =

ﬃﬃﬃﬃ

s

2

_

=

ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

P

n

i=1

x

i

÷x ( )

2

n ÷ 1

v

u

u

u

t

(2.5.4)

The standard deviation of a finite population is obtained by taking the square root of the

quantity obtained by Equation 2.5.3, and is represented by s.

The Coefﬁcient of Variation The standard deviation is useful as a measure of

variation within a given set of data. When one desires to compare the dispersion in two sets

of data, however, comparing the two standard deviations may lead to fallacious results. It

may be that the two variables involved are measured in different units. For example, we

may wish to know, for a certain population, whether serum cholesterol levels, measured in

milligrams per 100 ml, are more variable than body weight, measured in pounds.

Furthermore, although the same unit of measurement is used, the two means may be

quite different. If we compare the standard deviation of weights of first-grade children with

the standard deviation of weights of high school freshmen, we may find that the latter

standard deviation is numerically larger than the former, because the weights themselves

are larger, not because the dispersion is greater.

2.5 DESCRIPTIVE STATISTICS: MEASURES OF DISPERSION 45

3GC02 11/07/2012 21:59:8 Page 46

What is needed in situations like these is a measure of relative variation rather than

absolute variation. Such a measure is found in the coefficient of variation, which expresses

the standard deviation as a percentage of the mean. The formula is given by

C:V: =

s

x

100 ( )% (2.5.5)

We see that, since the mean and standard deviations are expressed in the same unit of

measurement, the unit of measurement cancels out in computing the coefficient of

variation. What we have, then, is a measure that is independent of the unit of measurement.

EXAMPLE 2.5.3

Suppose two samples of human males yield the following results:

Sample 1 Sample 2

Age 25 years 11 years

Mean weight 145 pounds 80 pounds

Standard deviation 10 pounds 10 pounds

We wish to know which is more variable, the weights of the 25-year-olds or the weights of

the 11-year-olds.

Solution: A comparison of the standard deviations might lead one to conclude that the

two samples possess equal variability. If we compute the coefficients of

variation, however, we have for the 25-year-olds

C:V: =

10

145

100 ( ) = 6:9%

and for the 11-year-olds

C:V: =

10

80

100 ( ) = 12:5%

If we compare these results, we get quite a different impression. It is clear

from this example that variation is much higher in the sample of 11-year-olds

than in the sample of 25-year-olds. &

The coefficient of variation is also useful in comparing the results obtained by

different persons who are conducting investigations involving the same variable. Since the

coefficient of variation is independent of the scale of measurement, it is a useful statistic for

comparing the variability of two or more variables measured on different scales. We could,

for example, use the coefficient of variation to compare the variability in weights of one

sample of subjects whose weights are expressed in pounds with the variability in weights of

another sample of subjects whose weights are expressed in kilograms.

46 CHAPTER 2 DESCRIPTIVE STATISTICS

3GC02 11/07/2012 21:59:8 Page 47

Computer Analysis Computer software packages provide a variety of possibilit-

ies in the calculation of descriptive measures. Figure 2.5.2 shows a printout of the

descriptive measures available from the MINITAB package. The data consist of the

ages from Example 2.4.2.

In the printout Q

1

and Q

3

are the first and third quartiles, respectively. These

measures are described later in this chapter. N stands for the number of data observations,

and N

+

stands for the number of missing values. The term SEMEAN stands for standard

error of the mean. This measure will be discussed in detail in a later chapter. Figure 2.5.3

shows, for the same data, the SAS

®

printout obtained by using the PROC MEANS

statement.

Percentiles and Quartiles The mean and median are special cases of a family

of parameters known as location parameters. These descriptive measures are called

location parameters because they can be used to designate certain positions on the

horizontal axis when the distribution of a variable is graphed. In that sense the so-called

location parameters “locate” the distribution on the horizontal axis. For example, a

distribution with a median of 100 is located to the right of a distribution with a median

of 50 when the two distributions are graphed. Other location parameters include percentiles

and quartiles. We may define a percentile as follows:

DEFINITION

Given a set of n observations x

1

; x

2

; . . . x

n

, the pth percentile P is the

value of X such that p percent or less of the observations are less than P

and 100 ÷ p ( ) percent or less of the observations are greater than P.

Variable N N* Mean SE Mean StDev Minimum Q1 Median Q3 Maximum

C1 10 0 56.00 3.00 9.49 38.00 48.25 58.00 64.25 66.00

FIGURE 2.5.2 Printout of descriptive measures computed from the sample of ages in

Example 2.4.2, MINITAB software package.

The MEANS Procedure

Analysis Variable: Age

N Mean Std Dev Minimum Maximum

10 56.0000000 9.4868330 38.0000000 66.0000000

Coeff of

Std Error Sum Variance Variation

3.0000000 560.0000000 90.0000000 16.9407732

FIGURE 2.5.3 Printout of descriptive measures computed from the sample of ages in

Example 2.4.2, SAS

®

software package.

2.5 DESCRIPTIVE STATISTICS: MEASURES OF DISPERSION 47

3GC02 11/07/2012 21:59:8 Page 48

Subscripts on P serve to distinguish one percentile from another. The 10th percentile,

for example, is designated P

10

, the 70th is designated P

70

, and so on. The 50th percentile is

the median and is designated P

50

. The 25th percentile is often referred to as the first quartile

and denoted Q

1

. The 50th percentile (the median) is referred to as the second or middle

quartile and written Q

2

, and the 75th percentile is referred to as the third quartile, Q

3

.

When we wish to find the quartiles for a set of data, the following formulas are used:

Q

1

=

n ÷ 1

4

th ordered observation

Q

2

=

2 n ÷ 1 ( )

4

=

n ÷ 1

2

th ordered observation

Q

3

=

3 n ÷ 1 ( )

4

th ordered observation

(2.5.6)

It should be noted that the equations shown in 2.5.6 determine the positions of the quartiles

in a data set, not the values of the quartiles. It should also be noted that though there is a

universal way to calculate the median (Q

2

), there are a variety of ways to calculate Q

1

, and

Q

2

values. For example, SAS provides for a total of five different ways to calculate the

quartile values, and other programs implement even different methods. For a discussion of

the various methods for calculating quartiles, interested readers are referred to the article

by Hyndman and Fan (3). To illustrate, note that the printout in MINITAB in Figure 2.5.2

shows Q

1

=48.25 and Q

3

=64.25, whereas program R yields the values Q

1

=52.75 and

Q

3

=63.25.

Interquartile Range As we have seen, the range provides a crude measure of

the variability present in a set of data. A disadvantage of the range is the fact that it is

computed from only two values, the largest and the smallest. A similar measure that

reflects the variability among the middle 50 percent of the observations in a data set is

the interquartile range.

DEFINITION

The interquartile range (IQR) is the difference between the third and first

quartiles: that is,

IQR = Q

3

÷ Q

1

(2.5.7)

A large IQR indicates a large amount of variability among the middle 50 percent of the

relevant observations, and a small IQR indicates a small amount of variability among the

relevant observations. Since such statements are rather vague, it is more informative to

compare the interquartile range with the range for the entire data set. A comparison may

be made by forming the ratio of the IQR to the range (R) and multiplying by 100. That is,

100 (IQR/R) tells us what percent the IQR is of the overall range.

Kurtosis Just as we may describe a distribution in terms of skewness, we may

describe a distribution in terms of kurtosis.

9

>

>

>

>

>

>

>

=

>

>

>

>

>

>

>

;

48 CHAPTER 2 DESCRIPTIVE STATISTICS

3GC02 11/07/2012 21:59:8 Page 49

DEFINITION

Kurtosis is a measure of the degree to which a distribution is “peaked” or

flat in comparison to a normal distribution whose graph is characterized

by a bell-shaped appearance.

A distribution, in comparison to a normal distribution, may possesses an excessive

proportion of observations in its tails, so that its graph exhibits a flattened appearance.

Such a distribution is said to be platykurtic. Conversely, a distribution, in comparison to a

normal distribution, may possess a smaller proportion of observations in its tails, so that its

graph exhibits a more peaked appearance. Such a distribution is said to be leptokurtic. A

normal, or bell-shaped distribution, is said to be mesokurtic.

Kurtosis can be expressed as

Kurtosis =

n

P

n

i=1

x

i

÷ x ( )

4

P

n

i=1

x

i

÷x ( )

2

2

÷ 3 =

n

P

n

i=1

x

i

÷x ( )

4

n ÷ 1 ( )

2

s

4

÷ 3 (2.5.8)

Manual calculation using Equation 2.5.8 is usually not necessary, since most statistical

packages calculate and report information regarding kurtosis as part of the descriptive

statistics for a data set. Note that each of the two parts of Equation 2.5.8 has been reduced

by 3. A perfectly mesokurtic distribution has a kurtosis measure of 3 based on the equation.

Most computer algorithms reduce the measure by 3, as is done in Equation 2.5.8, so that the

kurtosis measure of a mesokurtic distribution will be equal to 0. A leptokurtic distribution,

then, will have a kurtosis measure > 0, and a platykurtic distribution will have a kurtosis

measure < 0. Be aware that not all computer packages make this adjustment. In such cases,

comparisons with a mesokurtic distribution are made against 3 instead of against 0. Graphs

of distributions representing the three types of kurtosis are shown in Figure 2.5.4.

EXAMPLE 2.5.4

Consider the three distributions shown in Figure 2.5.4. Given that the histograms represent

frequency counts, the data can be easily re-created and entered into a statistical package.

For example, observation of the “mesokurtic” distribution would yield the following data:

1, 2, 2, 3, 3, 3, 3, 3, . . . , 9, 9, 9, 9, 9, 10, 10, 11. Values can be obtained from the other

distributions in a similar fashion. Using SPSS software, the following descriptive statistics

were obtained for these three distributions:

Mesokurtic Leptokurtic Platykurtic

Mean 6.0000 6.0000 6.0000

Median 6.0000 6.0000 6.0000

Mode 6.00 6.00 6.00

Kurtosis .000 .608 ÷1.158

&

2.5 DESCRIPTIVE STATISTICS: MEASURES OF DISPERSION 49

3GC02 11/07/2012 21:59:9 Page 50

Box-and-Whisker Plots A useful visual device for communicating the infor-

mation contained in a data set is the box-and-whisker plot. The construction of a box-and-

whisker plot (sometimes called, simply, a boxplot) makes use of the quartiles of a data set

and may be accomplished by following these five steps:

1. Represent the variable of interest on the horizontal axis.

2. Drawa box in the space above the horizontal axis in such a way that the left end of the

box aligns with the first quartile Q

1

and the right end of the box aligns with the third

quartile Q

3

.

3. Divide the box into two parts by a vertical line that aligns with the median Q

2

.

4. Draw a horizontal line called a whisker from the left end of the box to a point that

aligns with the smallest measurement in the data set.

5. Draw another horizontal line, or whisker, from the right end of the box to a point that

aligns with the largest measurement in the data set.

Examination of a box-and-whisker plot for a set of data reveals information

regarding the amount of spread, location of concentration, and symmetry of the data.

The following example illustrates the construction of a box-and-whisker plot.

EXAMPLE 2.5.5

Evans et al. (A-7) examined the effect of velocity on ground reaction forces (GRF) in

dogs with lameness from a torn cranial cruciate ligament. The dogs were walked and

trotted over a force platform, and the GRF was recorded during a certain phase of their

performance. Table 2.5.1 contains 20 measurements of force where each value shown is

the mean of five force measurements per dog when trotting.

FIGURE 2.5.4 Three histograms representing kurtosis.

TABLE 2.5.1 GRF Measurements When Trotting of 20 Dogs with a Lame

Ligament

14.6 24.3 24.9 27.0 27.2 27.4 28.2 28.8 29.9 30.7

31.5 31.6 32.3 32.8 33.3 33.6 34.3 36.9 38.3 44.0

Source: Data provided courtesy of Richard Evans, Ph.D.

50 CHAPTER 2 DESCRIPTIVE STATISTICS

3GC02 11/07/2012 21:59:9 Page 51

Solution: The smallest and largest measurements are 14.6 and 44, respectively. The

first quartile is the Q

1

= 20 ÷ 1 ( )=4 = 5:25th measurement, which is

27:2 ÷ :25 ( ) 27:4 ÷ 27:2 ( ) = 27:25. The median is the Q

2

÷ 20 ÷ 1 ( )=2 =

10:5 th measurement or 30:7 ÷ :5 ( ) 31:5 ÷ 30:7 ( ) = 31:1; and the third

quartile is the Q

3

÷ 3 20 ÷ 1 ( )=4 = 15:75th measurement, which is equal

to 33:3 ÷ :75 ( ) 33:6 ÷ 33:3 ( ) = 33:525. The interquartile range is

IQR = 33:525 ÷ 27:25 = 6:275. The range is 29.4, and the IQR is

100 6:275=29:4 ( ) = 21 percent of the range. The resulting box-and-whisker

plot is shown in Figure 2.5.5. &

Examination of Figure 2.5.5 reveals that 50 percent of the measurements are between

about 27 and 33, the approximate values of the first and third quartiles, respectively. The

vertical bar inside the box shows that the median is about 31.

Many statistical software packages have the capability of constructing box-and-

whisker plots. Figure 2.5.6 shows one constructed by MINITAB and one constructed by

NCSS fromthe data of Table 2.5.1. The procedure to produce the MINTABplot is shown in

Figure 2.5.7. The asterisks in Figure 2.5.6 alert us to the fact that the data set contains one

unusually large and one unusually small value, called outliers. The outliers are the dogs

that generated forces of 14.6 and 44. Figure 2.5.6 illustrates the fact that box-and-whisker

plots may be displayed vertically as well as horizontally.

An outlier, or a typical observation, may be defined as follows.

14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 46 44

GRF Measurements

FIGURE 2.5.5 Box-and-whisker plot for Example 2.5.5.

*

*

F

o

r

c

e

45

45

40

35

30

25

20

15

35

25

15

FIGURE 2.5.6 Box-and-whisker plot constructed by MINITAB (left) and by R (right) from the

data of Table 2.5.1.

2.5 DESCRIPTIVE STATISTICS: MEASURES OF DISPERSION 51

3GC02 11/07/2012 21:59:9 Page 52

DEFINITION

An outlier is an observation whose value, x, either exceeds the value

of the third quartile by a magnitude greater than 1.5(IQR) or is less than

the value of the first quartile by a magnitude greater than 1.5(IQR).

That is, an observation of x > Q

3

÷ 1:5 IQR ( ) or an observation of

x < Q

1

÷ 1:5 IQR ( ) is called an outlier.

For the data in Table 2.5.1 we may use the previously computed values of Q

1

; Q

3

,

and IQR to determine how large or how small a value would have to be in order to be

considered an outlier. The calculations are as follows:

x < 27:25 ÷ 1:5 6:275 ( ) = 17:8375 and x > 33:525 ÷ 1:5 6:275 ( ) = 42:9375

For the data in Table 2.5.1, then, an observed value smaller than 17.8375 or larger than

42.9375 would be considered an outlier.

The SAS

®

statement PROC UNIVARIATE may be used to obtain a box-and-whisker

plot. The statement also produces other descriptive measures and displays, including stem-

and-leaf plots, means, variances, and quartiles.

Exploratory Data Analysis Box-and-whisker plots and stem-and-leaf displays

are examples of what are known as exploratory data analysis techniques. These tech-

niques, made popular as a result of the work of Tukey (4), allowthe investigator to examine

data in ways that reveal trends and relationships, identify unique features of data sets, and

facilitate their description and summarization.

EXERCISES

For each of the data sets in the following exercises compute (a) the mean, (b) the median, (c) the

mode, (d) the range, (e) the variance, (f) the standard deviation, (g) the coefficient of variation, and (h)

the interquartile range. Treat each data set as a sample. For those exercises for which you think it

would be appropriate, construct a box-and-whisker plot and discuss the usefulness in understanding

the nature of the data that this device provides. For each exercise select the measure of central

tendency that you think would be most appropriate for describing the data. Give reasons to justify

your choice.

: d n a m m o c n o i s s e S : x o b g o l a i D

Stat EDA Boxplot Simple MTB > Boxplot ‘Force’;

Click OK. SUBC> IQRbox;

SUBC> Outlier.

Type Force Graph Variables.

Click OK.

FIGURE 2.5.7 MINITAB procedure to produce Figure 2.5.6.

52 CHAPTER 2 DESCRIPTIVE STATISTICS

3GC02 11/07/2012 21:59:10 Page 53

2.5.1 Porcellini et al. (A-8) studied 13 HIV-positive patients who were treated with highly active

antiretroviral therapy (HAART) for at least 6 months. The CD4 T cell counts ×10

6

=L

À Á

at baseline

for the 13 subjects are listed below.

230 205 313 207 227 245 173

58 103 181 105 301 169

Source: Simona Porcellini, Guiliana Vallanti, Silvia Nozza,

Guido Poli, Adriano Lazzarin, Guiseppe Tambussi,

Antonio Grassia, “Improved Thymopoietic Potential in

Aviremic HIV Infected Individuals with HAART by

Intermittent IL-2 Administration,” AIDS, 17 (2003),

1621–1630.

2.5.2 Shair and Jasper (A-9) investigated whether decreasing the venous return in young rats would affect

ultrasonic vocalizations (USVs). Their research showed no significant change in the number of

ultrasonic vocalizations when blood was removed from either the superior vena cava or the carotid

artery. Another important variable measured was the heart rate (bmp) during the withdrawal of blood.

The table below presents the heart rate of seven rat pups from the experiment involving the carotid

artery.

500 570 560 570 450 560 570

Source: Harry N. Shair and Anna Jasper, “Decreased

Venous Return Is Neither Sufficient nor Necessary to Elicit

Ultrasonic Vocalization of Infant Rat Pups,” Behavioral

Neuroscience, 117 (2003), 840–853.

2.5.3 Butz et al. (A-10) evaluated the duration of benefit derived from the use of noninvasive positive-

pressure ventilation by patients with amyotrophic lateral sclerosis on symptoms, quality of life, and

survival. One of the variables of interest is partial pressure of arterial carbon dioxide (PaCO

2

). The

values below (mm Hg) reflect the result of baseline testing on 30 subjects as established by arterial

blood gas analyses.

40.0 47.0 34.0 42.0 54.0 48.0 53.6 56.9 58.0 45.0

54.5 54.0 43.0 44.3 53.9 41.8 33.0 43.1 52.4 37.9

34.5 40.1 33.0 59.9 62.6 54.1 45.7 40.6 56.6 59.0

Source: M. Butz, K. H. Wollinsky, U. Widemuth-Catrinescu, A. Sperfeld,

S. Winter, H. H. Mehrkens, A. C. Ludolph, and H. Schreiber, “Longitudinal Effects

of Noninvasive Positive-Pressure Ventilation in Patients with Amyotrophic Lateral

Sclerosis,” American Journal of Medical Rehabilitation, 82 (2003), 597–604.

2.5.4 According to Starch et al. (A-11), hamstring tendon grafts have been the “weak link” in anterior

cruciate ligament reconstruction. In a controlled laboratory study, they compared two techniques for

reconstruction: either an interference screw or a central sleeve and screw on the tibial side. For eight

cadaveric knees, the measurements below represent the required force (in newtons) at which initial

failure of graft strands occurred for the central sleeve and screw technique.

172.5 216.63 212.62 98.97 66.95 239.76 19.57 195.72

Source: David W. Starch, Jerry W. Alexander, Philip C. Noble, Suraj Reddy, and David M.

Lintner, “Multistranded Hamstring Tendon Graft Fixation with a Central Four-Quadrant or

a Standard Tibial Interference Screw for Anterior Cruciate Ligament Reconstruction,” The

American Journal of Sports Medicine, 31 (2003), 338–344.

EXERCISES 53

3GC02 11/07/2012 21:59:10 Page 54

2.5.5 Cardosi et al. (A-12) performed a 4-year retrospective review of 102 women undergoing radical

hysterectomy for cervical or endometrial cancer. Catheter-associated urinary tract infection was

observed in 12 of the subjects. Below are the numbers of postoperative days until diagnosis of the

infection for each subject experiencing an infection.

16 10 49 15 6 15

8 19 11 22 13 17

Source: Richard J. Cardosi, Rosemary Cardosi, Edward

C. Grendys Jr., James V. Fiorica, and Mitchel S. Hoffman,

“Infectious Urinary Tract Morbidity with Prolonged

Bladder Catheterization After Radical Hysterectomy,” American

Journal of Obstetrics and Gynecology,

189 (2003), 380–384.

2.5.6 The purpose of a study by Nozawa et al. (A-13) was to evaluate the outcome of surgical repair of pars

interarticularis defect by segmental wire fixation in young adults with lumbar spondylolysis. The

authors found that segmental wire fixation historically has been successful in the treatment of

nonathletes with spondylolysis, but no information existed on the results of this type of surgery in

athletes. In a retrospective study, the authors found 20 subjects who had the surgery between 1993 and

2000. For these subjects, the data below represent the duration in months of follow-up care after the

operation.

103 68 62 60 60 54 49 44 42 41

38 36 34 30 19 19 19 19 17 16

Source: Satoshi Nozawa, Katsuji Shimizu, Kei Miyamoto, and

Mizuo Tanaka, “Repair of Pars Interarticularis Defect

by Segmental Wire Fixation in Young Athletes with

Spondylolysis,” American Journal of Sports Medicine, 31 (2003),

359–364.

2.5.7 See Exercise 2.3.1.

2.5.8 See Exercise 2.3.2.

2.5.9 See Exercise 2.3.3.

2.5.10 See Exercise 2.3.4.

2.5.11 See Exercise 2.3.5.

2.5.12 See Exercise 2.3.6.

2.5.13 See Exercise 2.3.7.

2.5.14 In a pilot study, Huizinga et al. (A-14) wanted to gain more insight into the psychosocial

consequences for children of a parent with cancer. For the study, 14 families participated in

semistructured interviews and completed standardized questionnaires. Below is the age of the

sick parent with cancer (in years) for the 14 families.

37 48 53 46 42 49 44

38 32 32 51 51 48 41

Source: Gea A. Huizinga, Winette T.A. van der Graaf, Annemike

Visser, Jos S. Dijkstra, and Josette E. H. M. Hoekstra-Weebers, “Psychosocial

Consequences for Children of a Parent with Cancer,” Cancer Nursing, 26

(2003), 195–202.

54 CHAPTER 2 DESCRIPTIVE STATISTICS

3GC02 11/07/2012 21:59:11 Page 55

2.6 SUMMARY

In this chapter various descriptive statistical procedures are explained. These include the

organization of data by means of the ordered array, the frequency distribution, the relative

frequency distribution, the histogram, and the frequency polygon. The concepts of

central tendency and variation are described, along with methods for computing their

more common measures: the mean, median, mode, range, variance, and standard

deviation. The reader is also introduced to the concepts of skewness and kurtosis,

and to exploratory data analysis through a description of stem-and-leaf displays and box-

and-whisker plots.

We emphasize the use of the computer as a tool for calculating descriptive measures

and constructing various distributions from large data sets.

SUMMARY OF FORMULAS FOR CHAPTER 2

Formula

Number Name Formula

2.3.1 Class interval width

using Sturges’s Rule

w =

R

k

2.4.1 Mean of a population

m =

P

N

i=1

x

i

N

2.4.2 Skewness

Skewness =

ﬃﬃﬃ

n

_ P

n

i=1

x

i

÷x ( )

3

P

n

i=1

x

i

÷x ( )

2

3

2

=

ﬃﬃﬃ

n

_ P

n

i=1

x

i

÷x ( )

3

n ÷ 1 ( )

ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

n ÷ 1

_

s

3

2.4.2 Mean of a sample

x =

P

n

i=1

x

i

n

2.5.1 Range R = x

L

÷ x

s

2.5.2 Sample variance

s

2

=

P

n

i=1

x

i

÷x ( )

2

n ÷ 1

2.5.3 Population variance

s

2

=

P

N

i=1

x

i

÷ m ( )

2

N

(Continued )

SUMMARY OF FORMULAS FOR CHAPTER 2 55

3GC02 11/07/2012 21:59:11 Page 56

2.5.4 Standard deviation

s =

ﬃﬃﬃﬃ

s

2

_

=

ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

P

n

i=1

x

i

÷x ( )

2

n ÷ 1

v

u

u

u

t

2.5.5 Coefficient of variation

C:V: =

s

x

100 ( )%

2.5.6 Quartile location in

ordered array

Q

1

=

1

4

n ÷ 1 ( )

Q

2

=

1

2

n ÷ 1 ( )

Q

3

=

3

4

n ÷ 1 ( )

2.5.7 Interquartile range IQR = Q

3

÷ Q

1

2.5.8 Kurtosis

Kurtosis =

P

n

i=1

x

i

÷x ( )

4

P

n

i=1

x

i

÷x ( )

2

2

÷ 3 =

n

P

n

i=1

x

i

÷x ( )

4

n ÷ 1 ( )

2

s

4

÷ 3

Symbol Key

v

C:V: = coefficient of variation

v

IQR = Interquartile range

v

k = number of class intervals

v

m = population mean

v

N = population size

v

n = sample size

v

n ÷ 1 ( ) = degrees of freedom

v

Q

1

= first quartile

v

Q

2

= second quartile = median

v

Q

3

= third quartile

v

R = range

v

s = standard deviation

v

s

2

= sample variance

v

s

2

= population variance

v

x

i

= i

th

data observation

v

x

L

= largest data point

v

x

S

= smallest data point

v

x = sample mean

v

w = class width

56 CHAPTER 2 DESCRIPTIVE STATISTICS

3GC02 11/07/2012 21:59:11 Page 57

REVIEWQUESTIONS ANDEXERCISES

1. Define:

(a) Stem-and-leaf display (b) Box-and-whisker plot

(c) Percentile (d) Quartile

(e) Location parameter (f) Exploratory data analysis

(g) Ordered array (h) Frequency distribution

(i) Relative frequency distribution (j) Statistic

(k) Parameter (l) Frequency polygon

(m) True class limits (n) Histogram

2. Define and compare the characteristics of the mean, the median, and the mode.

3. What are the advantages and limitations of the range as a measure of dispersion?

4. Explain the rationale for using n ÷ 1 to compute the sample variance.

5. What is the purpose of the coefficient of variation?

6. What is the purpose of Sturges’s rule?

7. What is another name for the 50th percentile (second or middle quartile)?

8. Describe from your field of study a population of data where knowledge of the central tendency and

dispersion would be useful. Obtain real or realistic synthetic values fromthis population and compute

the mean, median, mode, variance, and standard deviation.

9. Collect a set of real, or realistic, data fromyour field of study and construct a frequency distribution, a

relative frequency distribution, a histogram, and a frequency polygon.

10. Compute the mean, median, mode, variance, and standard deviation for the data in Exercise 9.

11. Find an article in a journal from your field of study in which some measure of central tendency and

dispersion have been computed.

12. The purpose of a study by Tam et al. (A-15) was to investigate the wheelchair maneuvering in

individuals with lower-level spinal cord injury (SCI) and healthy controls. Subjects used a modified

wheelchair to incorporate a rigid seat surface to facilitate the specified experimental measurements.

Interface pressure measurement was recorded by using a high-resolution pressure-sensitive mat with

a spatial resolution of 4 sensors per square centimeter taped on the rigid seat support. During static

sitting conditions, average pressures were recorded under the ischial tuberosities. The data for

measurements of the left ischial tuberosity (in mm Hg) for the SCI and control groups are shown

below.

Control 131 115 124 131 122 117 88 114 150 169

SCI 60 150 130 180 163 130 121 119 130 148

Source: Eric W. Tam, Arthur F. Mak, Wai Nga Lam, John H. Evans, and York Y.

Chow, “Pelvic Movement and Interface Pressure Distribution During Manual Wheel-

chair Propulsion,” Archives of Physical Medicine and Rehabilitation, 84 (2003),

1466–1472.

(a) Find the mean, median, variance, and standard deviation for the controls.

(b) Find the mean, median variance, and standard deviation for the SCI group.

REVIEWQUESTIONS AND EXERCISES 57

3GC02 11/07/2012 21:59:12 Page 58

(c) Construct a box-and-whisker plot for the controls.

(d) Construct a box-and-whisker plot for the SCI group.

(e) Do you believe there is a difference in pressure readings for controls and SCI subjects in this

study?

13. Johnson et al. (A-16) performed a retrospective review of 50 fetuses that underwent open fetal

myelomeningocele closure. The data below show the gestational age in weeks of the 50 fetuses

undergoing the procedure.

25 25 26 27 29 29 29 30 30 31

32 32 32 33 33 33 33 34 34 34

35 35 35 35 35 35 35 35 35 36

36 36 36 36 36 36 36 36 36 36

36 36 36 36 36 36 36 36 37 37

Source: Mark P. Johnson, Leslie N. Sutton, Natalie Rintoul, Timothy M. Crom-

bleholme, Alan W. Flake, Lori J. Howell, Holly L. Hedrick, R. Douglas Wilson, and

N. Scott Adzick, “Fetal Myelomeningocele Repair: Short-TermClinical Outcomes,”

American Journal of Obstetrics and Gynecology, 189 (2003), 482–487.

(a) Construct a stem-and-leaf plot for these gestational ages.

(b) Based on the stem-and-leaf plot, what one word would you use to describe the nature of the data?

(c) Why do you think the stem-and-leaf plot looks the way it does?

(d) Compute the mean, median, variance, and standard deviation.

14. The following table gives the age distribution for the number of deaths in New York State due to

accidents for residents age 25 and older.

Age (Years)

Number of Deaths

Due to Accidents

25–34 393

35–44 514

45–54 460

55–64 341

65–74 365

75–84 616

85–94

+

618

Source: New York State Department of Health, Vital

Statistics of New York State, 2000, Table 32: Death

Summary Information by Age.

+

May include deaths due to accident for adults over

age 94.

For these data construct a cumulative frequency distribution, a relative frequency distribution, and a

cumulative relative frequency distribution.

15. Krieser et al. (A-17) examined glomerular filtration rate (GFR) in pediatric renal transplant

recipients. GFR is an important parameter of renal function assessed in renal transplant recipients.

The following are measurements from 19 subjects of GFR measured with diethylenetriamine penta-

acetic acid. (Note: some subjects were measured more than once.)

58 CHAPTER 2 DESCRIPTIVE STATISTICS

3GC02 11/07/2012 21:59:12 Page 59

18 42

21 43

21 43

23 48

27 48

27 51

30 55

32 58

32 60

32 62

36 67

37 68

41 88

42 63

Source: Data provided courtesy of D. M. Z. Krieser, M.D.

(a) Compute mean, median, variance, standard deviation, and coefficient of variation.

(b) Construct a stem-and-leaf display.

(c) Construct a box-and-whisker plot.

(d) What percentage of the measurements is within one standard deviation of the mean? Two

standard deviations? Three standard deviations?

16. The following are the cystatin C levels (mg/L) for the patients described in Exercise 15 (A-17).

Cystatin C is a cationic basic protein that was investigated for its relationship to GFR levels. In

addition, creatinine levels are also given. (Note: Some subjects were measured more than once.)

Cystatin C (mg/L) Creatinine (mmol/L)

1.78 4.69 0.35 0.14

2.16 3.78 0.30 0.11

1.82 2.24 0.20 0.09

1.86 4.93 0.17 0.12

1.75 2.71 0.15 0.07

1.83 1.76 0.13 0.12

2.49 2.62 0.14 0.11

1.69 2.61 0.12 0.07

1.85 3.65 0.24 0.10

1.76 2.36 0.16 0.13

1.25 3.25 0.17 0.09

1.50 2.01 0.11 0.12

2.06 2.51 0.12 0.06

2.34

Source: Data provided courtesy of D. M. Z. Krieser, M.D.

(a) For each variable, compute the mean, median, variance, standard deviation, and coefficient of

variation.

(b) For each variable, construct a stem-and-leaf display and a box-and-whisker plot.

(c) Which set of measurements is more variable, cystatin C or creatinine? On what do you base your

answer?

REVIEWQUESTIONS AND EXERCISES 59

3GC02 11/07/2012 21:59:12 Page 60

17. Give three synonyms for variation (variability).

18. The following table shows the age distribution of live births in Albany County, New York, for

2000.

Mother’s Age Number of Live Births

10–14 7

15–19 258

20–24 585

25–29 841

30–34 981

35–39 526

40–44 99

45–49

+

4

Source: New York State Department of Health, Annual

Vital Statistics 2000, Table 7, Live Births by Resident

County and Mother’s Age.

+

May include live births to mothers over age 49.

For these data construct a cumulative frequency distribution, a relative frequency distribution, and a

cumulative relative frequency distribution.

19. Spivack (A-18) investigated the severity of disease associated with C. difficilie in pediatric inpatients.

One of the variables they examined was number of days patients experienced diarrhea. The data for

the 22 subjects in the study appear below. Compute the mean, median, variance, and standard

deviation.

3 11 3 4 14 2 4 5 3 11 2

2 3 2 1 1 7 2 1 1 3 2

Source: Jordan G. Spivack, Stephen C. Eppes, and Joel D. Klien,

“Clostridium Difficile–Associated Diarrhea in a Pediatric

Hospital,” Clinical Pediatrics, 42 (2003), 347–352.

20. Express in words the following properties of the sample mean:

(a) S x ÷x ( )

2

= a minimum

(b) nx = Sx

(c) S x ÷x ( ) = 0

21. Your statistics instructor tells you on the first day of class that there will be five tests during the term.

From the scores on these tests for each student, the instructor will compute a measure of central

tendency that will serve as the student’s final course grade. Before taking the first test, you must

choose whether you want your final grade to be the mean or the median of the five test scores. Which

would you choose? Why?

22. Consider the following possible class intervals for use in constructing a frequency distribution of

serum cholesterol levels of subjects who participated in a mass screening:

(a) 50–74 (b) 50–74 (c) 50–75

75–99 75–99 75–100

100–149 100–124 100–125

150–174 125–149 125–150

60 CHAPTER 2 DESCRIPTIVE STATISTICS

3GC02 11/07/2012 21:59:13 Page 61

175–199 150–174 150–175

200–249 175–199 175–200

250–274 200–224 200–225

etc. 225–249 225–250

etc. etc.

Which set of class intervals do you think is most appropriate for the purpose? Why? State specifically

for each one why you think the other two are less desirable.

23. On a statistics test students were asked to construct a frequency distribution of the blood creatine

levels (units/liter) for a sample of 300 healthy subjects. The mean was 95, and the standard deviation

was 40. The following class interval widths were used by the students:

(a) 1 (d) 15

(b) 5 (e) 20

(c) 10 (f) 25

Comment on the appropriateness of these choices of widths.

24. Give a health sciences-related example of a population of measurements for which the mean would

be a better measure of central tendency than the median.

25. Give a health sciences-related example of a population of measurements for which the median would

be a better measure of central tendency than the mean.

26. Indicate for the following variables which you think would be a better measure of central tendency,

the mean, the median, or mode, and justify your choice:

(a) Annual incomes of licensed practical nurses in the Southeast.

(b) Diagnoses of patients seen in the emergency department of a large city hospital.

(c) Weights of high-school male basketball players.

27. Refer to Exercise 2.3.11. Compute the mean, median, variance, standard deviation, first quartile, third

quartile, and interquartile range. Construct a boxplot of the data. Are the mode, median, and mean

equal? If not, explain why. Discuss the data in terms of variability. Compare the IQR with the range.

What does the comparison tell you about the variability of the observations?

28. Refer to Exercise 2.3.12. Compute the mean, median, variance, standard deviation, first quartile, third

quartile, and interquartile range. Construct a boxplot of the data. Are the mode, median, and mean

equal? If not, explain why. Discuss the data in terms of variability. Compare the IQR with the range.

What does the comparison tell you about the variability of the observations?

29. Thilothammal et al. (A-19) designed a study to determine the efficacy of BCG (bacillus

Calmette-Guerin) vaccine in preventing tuberculous meningitis. Among the data collected on

each subject was a measure of nutritional status (actual weight expressed as a percentage of

expected weight for actual height). The following table shows the nutritional status values of the

107 cases studied.

73.3 54.6 82.4 76.5 72.2 73.6 74.0

80.5 71.0 56.8 80.6 100.0 79.6 67.3

50.4 66.0 83.0 72.3 55.7 64.1 66.3

50.9 71.0 76.5 99.6 79.3 76.9 96.0

64.8 74.0 72.6 80.7 109.0 68.6 73.8

74.0 72.7 65.9 73.3 84.4 73.2 70.0

72.8 73.6 70.0 77.4 76.4 66.3 50.5

REVIEWQUESTIONS AND EXERCISES 61

3GC02 11/07/2012 21:59:14 Page 62

72.0 97.5 130.0 68.1 86.4 70.0 73.0

59.7 89.6 76.9 74.6 67.7 91.9 55.0

90.9 70.5 88.2 70.5 74.0 55.5 80.0

76.9 78.1 63.4 58.8 92.3 100.0 84.0

71.4 84.6 123.7 93.7 76.9 79.6

45.6 92.5 65.6 61.3 64.5 72.7

77.5 76.9 80.2 76.9 88.7 78.1

60.6 59.0 84.7 78.2 72.4 68.3

67.5 76.9 82.6 85.4 65.7 65.9

Source: Data provided courtesy of Dr. N. Thilothammal.

(a) For these data compute the following descriptive measures: mean, median, mode, variance,

standard deviation, range, first quartile, third quartile, and IQR.

(b) Construct the following graphs for the data: histogram, frequency polygon, stem-and-leaf plot,

and boxplot.

(c) Discuss the data in terms of variability. Compare the IQR with the range. What does the

comparison tell you about the variability of the observations?

(d) What proportion of the measurements are within one standard deviation of the mean? Two

standard deviations of the mean? Three standard deviations of the mean?

(e) What proportion of the measurements are less than 100?

(f) What proportion of the measurements are less than 50?

Exer cises for Use wit h Large Data Set s Availableon th eFollowing Websit e: www .wiley.com/

c ollege/daniel

1. Refer to the dataset NCBIRTH800. The North Carolina State Center for Health Statistics and

Howard W. Odum Institute for Research in Social Science at the University of North Carolina at

Chapel Hill (A-20) make publicly available birth and infant death data for all children born in the

state of North Carolina. These data can be accessed at www.irss.unc.edu/ncvital/bfd1down.html.

Records on birth data go back to 1968. This comprehensive data set for the births in 2001 contains

120,300 records. The data represents a random sample of 800 of those births and selected variables.

The variables are as follows:

Variable Label Description

PLURALITY Number of children born of the pregnancy

SEX Sex of child 1 = male; 2 = female ( )

MAGE Age of mother (years)

WEEKS Completed weeks of gestation (weeks)

MARITAL Marital status 1 = married; 2 = not married ( )

RACEMOM Race of mother (0 = other non-White, 1 = White; 2 = Black; 3 = American

Indian, 4 = Chinese; 5 = Japanese; 6 = Hawaiian; 7 = Filipino; 8 = Other

Asian or Pacific Islander)

HISPMOM Mother of Hispanic origin (C = Cuban; M = Mexican; N = Non-Hispanic,

O = other and unknown Hispanic, P = Puerto Rican, S = Central=South

American, U = not classifiable)

GAINED Weight gained during pregnancy (pounds)

SMOKE 0 = mother did not smoke during pregnancy

1 = mother did smoke during pregnancy

62 CHAPTER 2 DESCRIPTIVE STATISTICS

3GC02 11/07/2012 21:59:14 Page 63

DRINK 0 = mother did not consume alcohol during pregnancy

1 = mother did consume alcohol during pregnancy

TOUNCES Weight of child (ounces)

TGRAMS Weight of child (grams)

LOW 0 = infant was not low birth weight

1 = infant was low birth weight

PREMIE 0 = infant was not premature

1 = infant was premature

Premature defined at 36 weeks or sooner

For the variables of MAGE, WEEKS, GAINED, TOUNCES, and TGRAMS:0

1. Calculate the mean, median, standard deviation, IQR, and range.

2. For each, construct a histogram and comment on the shape of the distribution.

3. Do the histograms for TOUNCES and TGRAMS look strikingly similar? Why?

4. Construct box-and-whisker plots for all four variables.

5. Construct side-by-side box-and-whisker plots for the variable of TOUNCES for women who

admitted to smoking and women who did not admit to smoking. Do you see a difference in birth

weight in the two groups? Which group has more variability?

6. Construct side-by-side box-and-whisker plots for the variable of MAGE for women who are and are

not married. Do you see a difference in ages in the two groups? Which group has more variability?

Are the results surprising?

7. Calculate the skewness and kurtosis of the data set. What do they indicate?

REFERENCES

Methodology References

1. H. A. STURGES, “The Choice of a Class Interval,” Journal of the American Statistical Association, 21 (1926),

65–66.

2. HELEN M. WALKER, “Degrees of Freedom,” Journal of Educational Psychology, 31 (1940), 253–269.

3. ROB J. HYNDMAN and YANAN FAN, “Sample Quantiles in Statistical Packages,” The American Statistician, 50

(1996), 361–365.

4. JOHN W. TUKEY, Exploratory Data Analysis, Addison-Wesley, Reading, MA, 1977.

Applications References

A-1. FARHAD ATASSI, “Oral Home Care and the Reasons for Seeking Dental Care by Individuals on Renal Dialysis,”

Journal of Contemporary Dental Practice, 3 (2002), 031–041.

A-2. VALLABH JANARDHAN, ROBERT FRIEDLANDER, HOWARD RIINA, and PHILIP EDWIN STIEG, “Identifying Patients at Risk for

Postprocedural Morbidity after Treatment of Incidental Intracranial Aneurysms: The Role of Aneurysm Size and

Location,” Neurosurgical Focus, 13 (2002), 1–8.

A-3. A. HOEKEMA, B. HOVINGA, B. STEGENGA, and L. G. M. De BONT, “Craniofacial Morphology and Obstructive Sleep

Apnoea: A Cephalometric Analysis,” Journal of Oral Rehabilitation, 30 (2003), 690–696.

REFERENCES 63

3GC02 11/07/2012 21:59:14 Page 64

A-4. DAVID H. HOLBEN, “Selenium Content of Venison, Squirrel, and Beef Purchased or Produced in Ohio, a Low

Selenium Region of the United States,” Journal of Food Science, 67 (2002), 431–433.

A-5. ERIK SKJELBO, THEONEST K. MUTABINGWA, IB BYGBJERG, KARIN K. NIELSEN, LARS F. GRAM, and KIM BRØSEN,

“Chloroguanide Metabolism in Relation to the Efficacy in Malaria Prophylaxis and the S-Mephenytoin Oxidation

in Tanzanians,” Clinical Pharmacology & Therapeutics, 59 (1996), 304–311.

A-6. HENRIK SCHMIDT, POUL ERIK MORTENSEN, S

AREN LARS F

ALSGAARD, and ESTHER A. JENSEN, “Autotransfusion after

Coronary Artery Bypass Grafting Halves the Number of Patients Needing Blood Transfusion,” Annals of Thoracic

Surgery, 61 (1996), 1178–1181.

A-7. RICHARD EVANS, WANDA GORDON, and MIKE CONZEMIUS, “Effect of Velocity on Ground Reaction Forces in Dogs

with Lameness Attributable to Tearing of the Cranial Cruciate Ligament,” American Journal of Veterinary

Research, 64 (2003), 1479–1481.

A-8. SIMONA PORCELLINI, GUILIANA VALLANTI, SILVIA NOZZA, GUIDO POLI, ADRIANO LAZZARIN, GUISEPPE TAMBUSSI, and

ANTONIO GRASSIA, “Improved Thymopoietic Potential in Aviremic HIV Infected Individuals with HAART by

Intermittent IL-2 Administration,” AIDS, 17 (2003) 1621–1630.

A-9. HARRY N. SHAIR and ANNA JASPER, “Decreased Venous Return is Neither Sufficient nor Necessary to Elicit

Ultrasonic Vocalization of Infant Rat Pups,” Behavioral Neuroscience, 117 (2003), 840–853.

A-10. M. BUTZ, K. H. WOLLINSKY, U. WIDEMUTH-CATRINESCU, A. SPERFELD, S. WINTER, H. H. MEHRKENS, A. C.

LUDOLPH, and H. SCHREIBER, “Longitudinal Effects of Noninvasive Positive-Pressure Ventilation in

Patients with Amyotophic Lateral Sclerosis,” American Journal of Medical Rehabilitation, 82 (2003),

597–604.

A-11. DAVID W. STARCH, JERRY W. ALEXANDER, PHILIP C. NOBLE, SURAJ REDDY, and DAVID M. LINTNER, “Multistranded

Hamstring Tendon Graft Fixation with a Central Four-Quadrant or a Standard Tibial Interference Screw for

Anterior Cruciate Ligament Reconstruction,” American Journal of Sports Medicine, 31 (2003), 338–344.

A-12. RICHARD J. CARDOSI, ROSEMARY CARDOSI, EDWARD C. GRENDYS Jr., JAMES V. FIORICA, and MITCHEL S. HOFFMAN,

“Infectious Urinary Tract Morbidity with Prolonged Bladder Catheterization after Radical Hysterectomy,”

American Journal of Obstetrics and Gynecology, 189 (2003), 380–384.

A-13. SATOSHI NOZAWA, KATSUJI SHIMIZU, KEI MIYAMOTO, and MIZUO TANAKA, “Repair of Pars Interarticularis Defect by

Segmental Wire Fixation in Young Athletes with Spondylolysis,” American Journal of Sports Medicine, 31

(2003), 359–364.

A-14. GEA A. HUIZINGA, WINETTE T. A. van der GRAAF, ANNEMIKE VISSER, JOS S. DIJKSTRA, and JOSETTE E. H. M.

HOEKSTRA-WEEBERS, “Psychosocial Consequences for Children of a Parent with Cancer,” Cancer Nursing, 26

(2003), 195–202.

A-15. ERIC W. TAM, ARTHUR F. MAK, WAI NGA LAM, JOHN H. EVANS, and YORK Y. CHOW, “Pelvic Movement and Interface

Pressure Distribution During Manual Wheelchair Propulsion,” Archives of Physical Medicine and Rehabilita-

tion, 84 (2003), 1466–1472.

A-16. MARK P. JOHNSON, LESLIE N. SUTTON, NATALIE RINTOUL, TIMOTHY M. CROMBLEHOLME, ALAN W. FLAKE, LORI

J. HOWELL, HOLLY L. HEDRICK, R. DOUGLAS WILSON, and N. SCOTT ADZICK, “Fetal Myelomeningocele

Repair: Short-term Clinical Outcomes,” American Journal of Obstetrics and Gynecology, 189 (2003),

482–487.

A-17. D. M. Z. KRIESER, A. R. ROSENBERG, G. KAINER, and D. NAIDOO, “The Relationship between Serum Creatinine,

Serum Cystatin C, and Glomerular Filtration Rate in Pediatric Renal Transplant Recipients: A Pilot Study,”

Pediatric Transplantation, 6 (2002), 392–395.

A-18. JORDAN G. SPIVACK, STEPHEN C. EPPES, and JOEL D. KLIEN, “Clostridium Difficile—Associated Diarrhea in a

Pediatric Hospital,” Clinical Pediatrics, 42 (2003), 347–352.

A-19. N. THILOTHAMMAL, P. V. KRISHNAMURTHY, DESMOND K. RUNYAN, and K. BANU, “Does BCG Vaccine Prevent

Tuberculous Meningitis?” Archives of Disease in Childhood, 74 (1996), 144–147.

A-20. North Carolina State Center for Health Statistics and Howard W. Odum Institute for Research in Social Science

at the University of North Carolina at Chapel Hill. Birth data set for 2001 found at www.irss.unc.edu/ncvital/

bfd1down.html. All calculations were performed by John Holcomb and do not represent the findings of the

Center or Institute.

64 CHAPTER 2 DESCRIPTIVE STATISTICS

3GC03 11/07/2012 22:6:32 Page 65

CHAPTER 3

SOME BASIC PROBABILITY

CONCEPTS

CHAPTER OVERVIEW

Probabilitylays thefoundationfor statistical inference. This chapter provides a

brief overviewof the probability concepts necessary for understanding topics

covered in the chapters that follow. It also provides a context for under-

standing the probability distributions used in statistical inference, and intro-

duces the student to several measures commonly found in the medical

literature (e.g., the sensitivity and speciﬁcity of a test).

TOPICS

3.1 INTRODUCTION

3.2 TWO VIEWS OF PROBABILITY: OBJECTIVE AND SUBJECTIVE

3.3 ELEMENTARY PROPERTIES OF PROBABILITY

3.4 CALCULATING THE PROBABILITY OF AN EVENT

3.5 BAYES’ THEOREM, SCREENING TESTS, SENSITIVITY, SPECIFICITY,

AND PREDICTIVE VALUE POSITIVE AND NEGATIVE

3.6 SUMMARY

LEARNING OUTCOMES

After studying this chapter, the student will

1. understand classical, relative frequency, and subjective probability.

2. understand the properties of probability and selected probability rules.

3. be able to calculate the probability of an event.

4. be able to apply Bayes’ theorem when calculating screening test results.

3.1 INTRODUCTION

The theory of probability provides the foundation for statistical inference. However, this

theory, which is a branch of mathematics, is not the main concern of this book, and,

consequently, only its fundamental concepts are discussed here. Students who desire to

65

3GC03 11/07/2012 22:6:32 Page 66

pursue this subject should refer to the many books on probability available in most college

and university libraries. The books by Gut (1), Isaac (2), and Larson (3) are recommended.

The objectives of this chapter are to help students gain some mathematical ability in the

area of probability and to assist themin developing an understanding of the more important

concepts. Progress along these lines will contribute immensely to their success in under-

standing the statistical inference procedures presented later in this book.

The concept of probability is not foreign to health workers and is frequently

encountered in everyday communication. For example, we may hear a physician say

that a patient has a 50–50 chance of surviving a certain operation. Another physician may

say that she is 95 percent certain that a patient has a particular disease. A public health

nurse may say that nine times out of ten a certain client will break an appointment. As these

examples suggest, most people express probabilities in terms of percentages. In dealing

with probabilities mathematically, it is more convenient to express probabilities as

fractions. (Percentages result from multiplying the fractions by 100.) Thus, we measure

the probability of the occurrence of some event by a number between zero and one. The

more likely the event, the closer the number is to one; and the more unlikely the event, the

closer the number is to zero. An event that cannot occur has a probability of zero, and an

event that is certain to occur has a probability of one.

Health sciences researchers continually ask themselves if the results of their efforts

could have occurred by chance alone or if some other force was operating to produce the

observed effects. For example, suppose six out of ten patients suffering from some disease

are cured after receiving a certain treatment. Is such a cure rate likely to have occurred if

the patients had not received the treatment, or is it evidence of a true curative effect on the

part of the treatment? We shall see that questions such as these can be answered through the

application of the concepts and laws of probability.

3.2 TWOVIEWS OF PROBABILITY:

OBJECTIVE ANDSUBJECTIVE

Until fairly recently, probability was thought of by statisticians and mathematicians only as

an objective phenomenon derived from objective processes.

The concept of objective probability may be categorized further under the headings

of (1) classical, or a priori, probability, and (2) the relative frequency, or a posteriori,

concept of probability.

Classical Probability The classical treatment of probability dates back to the

17th century and the work of two mathematicians, Pascal and Fermat. Much of this theory

developed out of attempts to solve problems related to games of chance, such as those

involving the rolling of dice. Examples from games of chance illustrate very well the

principles involved in classical probability. For example, if a fair six-sided die is rolled, the

probability that a 1 will be observed is equal to 1=6 and is the same for the other five faces.

If a card is picked at random from a well-shuffled deck of ordinary playing cards, the

probability of picking a heart is 13=52. Probabilities such as these are calculated by the

processes of abstract reasoning. It is not necessary to roll a die or draw a card to compute

66 CHAPTER 3 SOME BASIC PROBABILITY CONCEPTS

3GC03 11/07/2012 22:6:32 Page 67

these probabilities. In the rolling of the die, we say that each of the six sides is equally likely

to be observed if there is no reason to favor any one of the six sides. Similarly, if there is no

reason to favor the drawing of a particular card froma deck of cards, we say that each of the

52 cards is equally likely to be drawn. We may define probability in the classical sense

as follows:

DEFINITION

If an event can occur in N mutually exclusive and equally likely ways,

and if m of these possess a trait E, the probability of the occurrence of E

is equal to m=N.

If we read P E ( ) as “the probability of E,” we may express this definition as

P E ( ) =

m

N

(3.2.1)

Relative Frequency Probability The relative frequency approach to prob-

ability depends on the repeatability of some process and the ability to count the number

of repetitions, as well as the number of times that some event of interest occurs. In this

context we may define the probability of observing some characteristic, E, of an event

as follows:

DEFINITION

If some process is repeated a large number of times, n, and if some

resulting event with the characteristic E occurs m times, the relative

frequency of occurrence of E, m=n, will be approximately equal to the

probability of E.

To express this definition in compact form, we write

P E ( ) =

m

n

(3.2.2)

We must keep in mind, however, that, strictly speaking, m=n is only an estimate of P E ( ).

Subjective Probability In the early 1950s, L. J. Savage (4) gave considerable

impetus to what is called the “personalistic” or subjective concept of probability. This view

holds that probability measures the confidence that a particular individual has in the truth of

a particular proposition. This concept does not rely on the repeatability of any process. In

fact, by applying this concept of probability, one may evaluate the probability of an event

that can only happen once, for example, the probability that a cure for cancer will be

discovered within the next 10 years.

Although the subjective view of probability has enjoyed increased attention over the

years, it has not been fully accepted by statisticians who have traditional orientations.

3.2 TWO VIEWS OF PROBABILITY: OBJECTIVE AND SUBJECTIVE 67

3GC03 11/07/2012 22:6:32 Page 68

Bayesian Methods Bayesian methods are named in honor of the Reverend

Thomas Bayes (1702–1761), an English clergyman who had an interest in mathematics.

Bayesian methods are an example of subjective probability, since it takes into considera-

tion the degree of belief that one has in the chance that an event will occur. While

probabilities based on classical or relative frequency concepts are designed to allow for

decisions to be made solely on the basis of collected data, Bayesian methods make use of

what are known as prior probabilities and posterior probabilities.

DEFINITION

The prior probability of an event is a probability based on prior

knowledge, prior experience, or results derived from prior

data collection activity.

DEFINITION

The posterior probability of an event is a probability obtained by using

new information to update or revise a prior probability.

As more data are gathered, the more is likely to be known about the “true” probability of the

event under consideration. Although the idea of updating probabilities based on new

information is in direct contrast to the philosophy behind frequency-of-occurrence proba-

bility, Bayesian concepts are widely used. For example, Bayesian techniques have found

recent application in the construction of e-mail spam filters. Typically, the application of

Bayesian concepts makes use of a mathematical formula called Bayes’ theorem. In Section

3.5 we employ Bayes’ theorem in the evaluation of diagnostic screening test data.

3.3 ELEMENTARY PROPERTIES

OF PROBABILITY

In 1933 the axiomatic approach to probability was formalized by the Russian mathemati-

cian A. N. Kolmogorov (5). The basis of this approach is embodied in three properties from

which a whole system of probability theory is constructed through the use of mathematical

logic. The three properties are as follows.

1. Given some process (or experiment) with n mutually exclusive outcomes (called

events), E

1

; E

2

; . . . ; E

n

, the probability of any event E

i

is assigned a nonnegative

number. That is,

P E

i

( ) _ 0 (3.3.1)

In other words, all events must have a probability greater than or equal to zero,

a reasonable requirement in view of the difficulty of conceiving of negative prob-

ability. A key concept in the statement of this property is the concept of mutually

exclusive outcomes. Two events are said to be mutually exclusive if they cannot occur

simultaneously.

68 CHAPTER 3 SOME BASIC PROBABILITY CONCEPTS

3GC03 11/07/2012 22:6:32 Page 69

2. The sum of the probabilities of the mutually exclusive outcomes is equal to 1.

P E

1

( ) ÷P E

2

( ) ÷ ÷P E

n

( ) = 1 (3.3.2)

This is the property of exhaustiveness and refers to the fact that the observer of

a probabilistic process must allow for all possible events, and when all are taken

together, their total probability is 1. The requirement that the events be mutually

exclusive is specifying that the events E

1

; E

2

; . . . ; E

n

do not overlap; that is, no two of

them can occur at the same time.

3. Consider any two mutually exclusive events, E

i

and E

j

. The probability of the

occurrence of either E

i

or E

j

is equal to the sum of their individual probabilities.

P E

i

÷E

j

À Á

= P E

i

( ) ÷P E

j

À Á

(3.3.3)

Suppose the two events were not mutually exclusive; that is, suppose they could

occur at the same time. In attempting to compute the probability of the occurrence of either

E

i

or E

j

the problem of overlapping would be discovered, and the procedure could become

quite complicated. This concept will be discusses further in the next section.

3.4 CALCULATINGTHE PROBABILITY

OF ANEVENT

We nowmake use of the concepts and techniques of the previous sections in calculating the

probabilities of specific events. Additional ideas will be introduced as needed.

EXAMPLE 3.4.1

The primary aim of a study by Carter et al. (A-1) was to investigate the effect of the age at

onset of bipolar disorder on the course of the illness. One of the variables investigated was

family history of mood disorders. Table 3.4.1 shows the frequency of a family history of

TABLE 3.4.1 Frequency of Family History of Mood Disorder by

Age Group among Bipolar Subjects

Family History of Mood Disorders Early = 18(E) Later > 18(L) Total

Negative (A) 28 35 63

Bipolar disorder (B) 19 38 57

Unipolar (C) 41 44 85

Unipolar and bipolar (D) 53 60 113

Total 141 177 318

Source: Tasha D. Carter, Emanuela Mundo, Sagar V. Parkh, and James L. Kennedy,

“Early Age at Onset as a Risk Factor for Poor Outcome of Bipolar Disorder,” Journal of

Psychiatric Research, 37 (2003), 297–303.

3.4 CALCULATING THE PROBABILITY OF AN EVENT 69

3GC03 11/07/2012 22:6:32 Page 70

mood disorders in the two groups of interest (Early age at onset defined to be 18 years or

younger and Later age at onset defined to be later than 18 years). Suppose we pick a person

at random from this sample. What is the probability that this person will be 18 years old

or younger?

Solution: For purposes of illustrating the calculation of probabilities we consider this

group of 318 subjects to be the largest group for which we have an interest. In

other words, for this example, we consider the 318 subjects as a population.

We assume that Early and Later are mutually exclusive categories and that the

likelihood of selecting any one person is equal to the likelihood of selecting

any other person. We define the desired probability as the number of subjects

with the characteristic of interest (Early) divided by the total number of

subjects. We may write the result in probability notation as follows:

P(E) = number of Early subjects=total number of subjects

= 141=318 = :4434 &

Conditional Probability On occasion, the set of “all possible outcomes” may

constitute a subset of the total group. In other words, the size of the group of interest may be

reduced by conditions not applicable to the total group. When probabilities are calculated

with a subset of the total group as the denominator, the result is a conditional probability.

The probability computed in Example 3.4.1, for example, may be thought of as an

unconditional probability, since the size of the total group served as the denominator. No

conditions were imposed to restrict the size of the denominator. We may also think of this

probability as a marginal probability since one of the marginal totals was used as the

numerator.

We may illustrate the concept of conditional probability by referring again to

Table 3.4.1.

EXAMPLE 3.4.2

Suppose we pick a subject at random from the 318 subjects and find that he is 18 years or

younger (E). What is the probability that this subject will be one who has no family history

of mood disorders (A)?

Solution: The total number of subjects is no longer of interest, since, with the selection

of an Early subject, the Later subjects are eliminated. We may define the

desired probability, then, as follows: What is the probability that a subject has

no family history of mood disorders (A), given that the selected subject is

Early (E)? This is a conditional probability and is written as P(A[ E) in which

the vertical line is read “given.” The 141 Early subjects become the

denominator of this conditional probability, and 28, the number of Early

subjects with no family history of mood disorders, becomes the numerator.

Our desired probability, then, is

P(A[ E) = 28=141 = :1986

&

70 CHAPTER 3 SOME BASIC PROBABILITY CONCEPTS

3GC03 11/07/2012 22:6:33 Page 71

Joint Probability Sometimes we want to find the probability that a subject picked

at random from a group of subjects possesses two characteristics at the same time. Such a

probability is referred to as a joint probability. We illustrate the calculation of a joint

probability with the following example.

EXAMPLE 3.4.3

Let us refer again to Table 3.4.1. What is the probability that a person picked at random

from the 318 subjects will be Early (E) and will be a person who has no family history of

mood disorders (A)?

Solution: The probability we are seeking may be written in symbolic notation as

P(E ¨ A) in which the symbol ¨ is read either as “intersection” or “and.” The

statement E ¨ A indicates the joint occurrence of conditions E and A. The

number of subjects satisfying both of the desired conditions is found in

Table 3.4.1 at the intersection of the column labeled E and the row labeled A

and is seen to be 28. Since the selection will be made from the total set of

subjects, the denominator is 318. Thus, we may write the joint probability as

P(E ¨ A) = 28=318 = :0881

&

The Multiplication Rule A probability may be computed from other probabili-

ties. For example, a joint probability may be computed as the product of an appropriate

marginal probability and an appropriate conditional probability. This relationship is known

as the multiplication rule of probability. We illustrate with the following example.

EXAMPLE 3.4.4

We wish to compute the joint probability of Early age at onset (E) and a negative family

history of mood disorders (A) from a knowledge of an appropriate marginal probability and

an appropriate conditional probability.

Solution: The probability we seek is P(E ¨ A). We have already computed a marginal

probability, P(E) = 141=318 = :4434, and a conditional probability,

P(A[E) = 28=141 = :1986. It so happens that these are appropriate marginal

and conditional probabilities for computing the desired joint probability. We

may now compute P(E ¨ A) = P(E)P(A[ E) = (:4434)(:1986) = :0881.

This, wenote, is, asexpected, thesameresult weobtainedearlier for P(E ¨ A).&

We may state the multiplication rule in general terms as follows: For any two events

A and B,

P A ¨ B ( ) = P B ( )P A[ B ( ); if P B ( ) ,= 0 (3.4.1)

For the same two events A and B, the multiplication rule may also be written as

P A ¨ B ( ) = P A ( )P B[ A ( ); if P A ( ) ,= 0.

We see that through algebraic manipulation the multiplication rule as stated in

Equation 3.4.1 may be used to find any one of the three probabilities in its statement if the

other two are known. We may, for example, find the conditional probability P A[ B ( ) by

3.4 CALCULATING THE PROBABILITY OF AN EVENT 71

3GC03 11/07/2012 22:6:33 Page 72

dividing P A ¨ B ( ) by P B ( ). This relationship allows us to formally define conditional

probability as follows.

DEFINITION

The conditional probability of A given B is equal to the probability of

A ¨ B divided by the probability of B, provided the probability of B

is not zero.

That is,

P A[ B ( ) =

P A ¨ B ( )

P B ( )

; P B ( ) ,= 0 (3.4.2)

We illustrate the use of the multiplication rule to compute a conditional probability with the

following example.

EXAMPLE 3.4.5

We wish to use Equation 3.4.2 and the data in Table 3.4.1 to find the conditional probability,

P(A[ E)

Solution: According to Equation 3.4.2,

P(A[ E) = P(A ¨ E)=P(E)

&

Earlier we found P E ¨ A ( ) = P A ¨ E ( ) = 28=318 = :0881. We have also determined that

P E ( ) = 141=318 = :4434. Using these results we are able to compute P A[ E ( ) =

:0881=:4434 = :1987, which, as expected, is the same result we obtained by using the

frequencies directly from Table 3.4.1. (The slight discrepancy is due to rounding.)

The Addition Rule The third property of probability given previously states that

the probability of the occurrence of either one or the other of two mutually exclusive events

is equal to the sum of their individual probabilities. Suppose, for example, that we pick a

person at random from the 318 represented in Table 3.4.1. What is the probability that this

person will be Early age at onset E ( ) or Later age at onset L ( )? We state this probability

in symbols as P E L ( ), where the symbol is read either as “union” or “or.” Since the

two age conditions are mutually exclusive, P E ¨ L ( ) = 141=318 ( ) ÷ 177=318 ( ) =

:4434 ÷:5566 = 1.

What if two events are not mutually exclusive? This case is covered by what is known

as the addition rule, which may be stated as follows:

DEFINITION

Given two events A and B, the probability that event A, or event B, or

both occur is equal to the probability that event A occurs, plus the

probability that event B occurs, minus the probability that the events

occur simultaneously.

72 CHAPTER 3 SOME BASIC PROBABILITY CONCEPTS

3GC03 11/07/2012 22:6:34 Page 73

The addition rule may be written

P A B ( ) = P A ( ) ÷P B ( ) ÷P A ¨ B ( ) (3.4.3)

When events A and B cannot occur simultaneously, P A ¨ B ( ) is sometimes called

“exclusive or,” and P A B ( ) = 0. When events A and B can occur simultaneously,

P A B ( ) is sometimes called “inclusive or,” and we use the addition rule to calculate

P A B ( ). Let us illustrate the use of the addition rule by means of an example.

EXAMPLE 3.4.6

If we select a person at randomfromthe 318 subjects represented in Table 3.4.1, what is the

probability that this person will be an Early age of onset subject (E) or will have no family

history of mood disorders (A) or both?

Solution: The probability we seek is P(E A). By the addition rule as expressed

by Equation 3.4.3, this probability may be written as P(E A) =

P(E) ÷P(A) ÷P(E ¨ A). We have already found that P(E) = 141=318 =

:4434 and P(E ¨ A) = 28=318 = :0881. From the information in Table 3.4.1

we calculate P(A) = 63=318 = :1981. Substituting these results into the

equation for P(E A) we have P(E A) = :4434 ÷:1981 ÷:0881 =

:5534. &

Note that the 28 subjects who are both Early and have no family history of mood disorders

are included in the 141 who are Early as well as in the 63 who have no family history of

mood disorders. Since, in computing the probability, these 28 have been added into the

numerator twice, they have to be subtracted out once to overcome the effect of duplication,

or overlapping.

Independent Events Suppose that, in Equation 3.4.2, we are told that event B has

occurred, but that this fact has no effect on the probability of A. That is, suppose that the

probability of event A is the same regardless of whether or not B occurs. In this situation,

P A[ B ( ) = P A ( ). In such cases we say that A and B are independent events. The

multiplication rule for two independent events, then, may be written as

P A ¨ B ( ) = P A ( )P B ( ); P A ( ) ,= 0; P B ( ) ,= 0 (3.4.4)

Thus, we see that if two events are independent, the probability of their joint

occurrence is equal to the product of the probabilities of their individual occurrences.

Note that when two events with nonzero probabilities are independent, each of the

following statements is true:

P A[ B ( ) = P A ( ); P B[A ( ) = P B ( ); P A ¨ B ( ) = P A ( )P B ( )

Two events are not independent unless all these statements are true. It is important to be

aware that the terms independent and mutually exclusive do not mean the same thing.

Let us illustrate the concept of independence by means of the following example.

3.4 CALCULATING THE PROBABILITY OF AN EVENT 73

3GC03 11/07/2012 22:6:34 Page 74

EXAMPLE 3.4.7

In a certain high school class, consisting of 60 girls and 40 boys, it is observed that 24 girls

and 16 boys wear eyeglasses. If a student is picked at random from this class, the

probability that the student wears eyeglasses, P(E), is 40=100, or .4.

(a) What is the probability that a student picked at random wears eyeglasses, given that

the student is a boy?

Solution: By using the formula for computing a conditional probability, we find this

to be

P(E [ B) =

P(E ¨ B)

P(B)

=

16=100

40=100

= :4

Thus the additional information that a student is a boy does not alter the

probability that the student wears eyeglasses, and P(E) = P(E [ B). We say

that the events being a boy and wearing eyeglasses for this group are

independent. We may also show that the event of wearing eyeglasses, E,

and not being a boy,

B are also independent as follows:

P(E [

B) =

P(E ¨

B)

P(

B)

=

24=100

60=100

=

24

60

= :4

(b) What is the probability of the joint occurrence of the events of wearing eyeglasses

and being a boy?

Solution: Using the rule given in Equation 3.4.1, we have

P(E ¨ B) = P(B)P(E [ B)

but, since we have shown that events E and B are independent we may replace

P(E [ B) by P(E) to obtain, by Equation 3.4.4,

P(E ¨ B) = P(B)P(E)

=

40

100

40

100

= :16

&

Complementary Events Earlier, using the data in Table 3.4.1, we computed the

probability that a person picked at random from the 318 subjects will be an Early age of

onset subject as P E ( ) = 141=318 = :4434. We found the probability of a Later age at onset

to be P L ( ) = 177=318 = :5566. The sum of these two probabilities we found to be equal

to 1. This is true because the events being Early age at onset and being Later age at onset are

complementary events. In general, we may make the following statement about comple-

mentary events. The probability of an event A is equal to 1 minus the probability of its

74 CHAPTER 3 SOME BASIC PROBABILITY CONCEPTS

3GC03 11/07/2012 22:6:34 Page 75

complement, which is written

A and

P

A ( ) = 1 ÷P A ( ) (3.4.5)

This follows from the third property of probability since the event, A, and its

complement,

A are mutually exclusive.

EXAMPLE 3.4.8

Suppose that of 1200 admissions to a general hospital during a certain period of time, 750

are private admissions. If we designate these as set A, then

A is equal to 1200 minus 750, or

450. We may compute

P(A) = 750=1200 = :625

and

P(

A) = 450=1200 = :375

and see that

P(

A) = 1 ÷P(A)

:375 = 1 ÷:625

:375 = :375

&

Marginal Probability Earlier we used the term marginal probability to refer

to a probability in which the numerator of the probability is a marginal total from a table

such as Table 3.4.1. For example, when we compute the probability that a person picked

at random from the 318 persons represented in Table 3.4.1 is an Early age of onset

subject, the numerator of the probability is the total number of Early subjects, 141. Thus,

P E ( ) = 141=318 = :4434. We may define marginal probability more generally as follows:

DEFINITION

Given some variable that can be broken down into m categories

designated by A

1

; A

2

; . . . ; A

i

; . . . ; A

m

and another jointly occurring

variable that is broken down into n categories designated by B

1

;

B

2

; . . . ; B

j

; . . . ; B

n

, the marginal probability of A

i

; P A

i

( ), is equal to the

sum of the joint probabilities of A

i

with all the categories of B. That is,

P A

i

( ) = SP A

i

¨ B

j

À Á

; for all values of j (3.4.6)

The following example illustrates the use of Equation 3.4.6 in the calculation of a marginal

probability.

EXAMPLE 3.4.9

We wish to use Equation 3.4.6 and the data in Table 3.4.1 to compute the marginal

probability P(E).

3.4 CALCULATING THE PROBABILITY OF AN EVENT 75

3GC03 11/07/2012 22:6:35 Page 76

Solution: The variable age at onset is broken down into two categories, Early for onset

18 years or younger (E) and Later for onset occurring at an age over 18 years

(L). The variable family history of mood disorders is broken down into four

categories: negative family history (A), bipolar disorder only (B), unipolar

disorder only (C), and subjects with a history of both unipolar and bipolar

disorder (D). The category Early occurs jointly with all four categories of the

variable family history of mood disorders. The four joint probabilities that

may be computed are

P E ¨ A ( ) = 28=318 = :0881

P E ¨ B ( ) = 19=318 = :0597

P E ¨ C ( ) = 41=318 = :1289

P E ¨ D ( ) = 53=318 = :1667

We obtain the marginal probability P(E) by adding these four joint probabili-

ties as follows:

P E ( ) = P E ¨ A ( ) ÷P E ¨ B ( ) ÷P E ¨ C ( ) ÷P E ¨ D ( )

= :0881 ÷:0597 ÷:1289 ÷:1667

= :4434 &

The result, as expected, is the same as the one obtained by using the marginal total for

Early as the numerator and the total number of subjects as the denominator.

EXERCISES

3.4.1 In a study of violent victimization of women and men, Porcerelli et al. (A-2) collected information

from 679 women and 345 men aged 18 to 64 years at several family practice centers in the

metropolitan Detroit area. Patients filled out a health history questionnaire that included a question

about victimization. The following table shows the sample subjects cross-classified by sex and the

type of violent victimization reported. The victimization categories are defined as no victimization,

partner victimization (and not by others), victimization by persons other than partners (friends,

family members, or strangers), and those who reported multiple victimization.

No Victimization Partners Nonpartners Multiple Victimization Total

Women 611 34 16 18 679

Men 308 10 17 10 345

Total 919 44 33 28 1024

Source: Data provided courtesy of John H. Porcerelli, Ph.D., Rosemary Cogan, Ph.D.

(a) Suppose we pick a subject at random from this group. What is the probability that this subject

will be a woman?

(b) What do we call the probability calculated in part a?

(c) Show how to calculate the probability asked for in part a by two additional methods.

76 CHAPTER 3 SOME BASIC PROBABILITY CONCEPTS

3GC03 11/07/2012 22:6:35 Page 77

(d) If we pick a subject at random, what is the probability that the subject will be a woman and have

experienced partner abuse?

(e) What do we call the probability calculated in part d?

(f) Suppose we picked a man at random. Knowing this information, what is the probability that he

experienced abuse from nonpartners?

(g) What do we call the probability calculated in part f?

(h) Suppose we pick a subject at random. What is the probability that it is a man or someone who

experienced abuse from a partner?

(i) What do we call the method by which you obtained the probability in part h?

3.4.2 Fernando et al. (A-3) studied drug-sharing among injection drug users in the South Bronx in New

York City. Drug users in New York City use the term “split a bag” or “get down on a bag” to refer to

the practice of dividing a bag of heroin or other injectable substances. A common practice includes

splitting drugs after they are dissolved in a common cooker, a procedure with considerable HIV risk.

Although this practice is common, little is known about the prevalence of such practices. The

researchers asked injection drug users in four neighborhoods in the South Bronx if they ever

“got down on” drugs in bags or shots. The results classified by gender and splitting practice are

given below:

Gender Split Drugs Never Split Drugs Total

Male 349 324 673

Female 220 128 348

Total 569 452 1021

Source: Daniel Fernando, Robert F. Schilling, Jorge Fontdevila,

and Nabila El-Bassel, “Predictors of Sharing Drugs among

Injection Drug Users in the South Bronx: Implications for HIV

Transmission,” Journal of Psychoactive Drugs, 35 (2003), 227–236.

(a) How many marginal probabilities can be calculated from these data? State each in probability

notation and do the calculations.

(b) How many joint probabilities can be calculated? State each in probability notation and do the

calculations.

(c) How many conditional probabilities can be calculated? State each in probability notation and do

the calculations.

(d) Use the multiplication rule to find the probability that a person picked at random never split

drugs and is female.

(e) What do we call the probability calculated in part d?

(f) Use the multiplication rule to find the probability that a person picked at random is male, given

that he admits to splitting drugs.

(g) What do we call the probability calculated in part f?

3.4.3 Refer to the data in Exercise 3.4.2. State the following probabilities in words and calculate:

(a) P Male ¨ Split Drugs ( )

(b) P Male Split Drugs ( )

(c) P Male [ Split Drugs ( )

(d) P(Male)

EXERCISES 77

3GC03 11/07/2012 22:6:35 Page 78

3.4.4 Laveist and Nuru-Jeter (A-4) conducted a study to determine if doctor–patient race concordance was

associated with greater satisfaction with care. Toward that end, they collected a national sample of

African-American, Caucasian, Hispanic, and Asian-American respondents. The following table

classifies the race of the subjects as well as the race of their physician:

Patient’s Race

Physician’s Race Caucasian

African-

American Hispanic

Asian-

American Total

White 779 436 406 175 1796

African-American 14 162 15 5 196

Hispanic 19 17 128 2 166

Asian=Pacific-Islander 68 75 71 203 417

Other 30 55 56 4 145

Total 910 745 676 389 2720

Source: Thomas A. Laveist and Amani Nuru-Jeter, “Is Doctor–Patient Race Concordance Associated with Greater

Satisfaction with Care?” Journal of Health and Social Behavior, 43 (2002), 296–306.

(a) What is the probability that a randomly selected subject will have an Asian=Pacific-Islander

physician?

(b) What is the probability that an African-American subject will have an African-American

physician?

(c) What is the probability that a randomly selected subject in the study will be Asian-American and

have an Asian=Pacific-Islander physician?

(d) What is the probability that a subject chosen at random will be Hispanic or have a Hispanic

physician?

(e) Use the concept of complementary events to find the probability that a subject chosen at random

in the study does not have a white physician.

3.4.5 If the probability of left-handedness in a certain group of people is .05, what is the probability of

right-handedness (assuming no ambidexterity)?

3.4.6 The probability is .6 that a patient selected at random from the current residents of a certain hospital

will be a male. The probability that the patient will be a male who is in for surgery is .2. A patient

randomly selected fromcurrent residents is found to be a male; what is the probability that the patient

is in the hospital for surgery?

3.4.7 In a certain population of hospital patients the probability is .35 that a randomly selected patient will

have heart disease. The probability is .86 that a patient with heart disease is a smoker. What is the prob-

ability that a patient randomly selected from the population will be a smoker and have heart disease?

3.5 BAYES’ THEOREM, SCREENINGTESTS,

SENSITIVITY, SPECIFICITY, ANDPREDICTIVE

VALUE POSITIVE ANDNEGATIVE

In the health sciences field a widely used application of probability laws and concepts is

found in the evaluation of screening tests and diagnostic criteria. Of interest to clinicians is

an enhanced ability to correctly predict the presence or absence of a particular disease from

78 CHAPTER 3 SOME BASIC PROBABILITY CONCEPTS

3GC03 11/07/2012 22:6:35 Page 79

knowledge of test results (positive or negative) and=or the status of presenting symptoms

(present or absent). Also of interest is information regarding the likelihood of positive and

negative test results and the likelihood of the presence or absence of a particular symptom

in patients with and without a particular disease.

In our consideration of screening tests, we must be aware of the fact that they are not

always infallible. That is, a testing procedure may yield a false positive or a false negative.

DEFINITION

1. A false positive results when a test indicates a positive status when

the true status is negative.

2. A false negative results when a test indicates a negative status when

the true status is positive.

In summary, the following questions must be answered in order to evaluate the

usefulness of test results and symptom status in determining whether or not a subject has

some disease:

1. Given that a subject has the disease, what is the probability of a positive test result (or

the presence of a symptom)?

2. Given that a subject does not have the disease, what is the probability of a negative

test result (or the absence of a symptom)?

3. Given a positive screening test (or the presence of a symptom), what is the probability

that the subject has the disease?

4. Given a negative screening test result (or the absence of a symptom), what is the

probability that the subject does not have the disease?

Suppose we have for a sample of n subjects (where n is a large number) the

information shown in Table 3.5.1. The table shows for these n subjects their status with

regard to a disease and results from a screening test designed to identify subjects with the

disease. The cell entries represent the number of subjects falling into the categories defined

by the row and column headings. For example, a is the number of subjects who have the

disease and whose screening test result was positive.

As we have learned, a variety of probability estimates may be computed from the

information displayed in a two-way table such as Table 3.5.1. For example, we may

TABLE 3.5.1 Sample of n Subjects (Where n Is

Large) Cross-Classiﬁed According to Disease Status

and Screening Test Result

Disease

Test Result Present (D) Absent (

D) Total

Positive (T) a b a ÷b

Negative (

T) c d c ÷d

Total a ÷c b ÷d n

3.5 BAYES’ THEOREM, SCREENING TESTS, SENSITIVITY, SPECIFICITY 79

3GC03 11/07/2012 22:6:36 Page 80

compute the conditional probability estimate P T [ D ( ) = a= a ÷c ( ). This ratio is an

estimate of the sensitivity of the screening test.

DEFINITION

The sensitivity of a test (or symptom) is the probability of a positive test

result (or presence of the symptom) given the presence of the disease.

We may also compute the conditional probability estimate P

T [

D ( ) = d= b ÷d ( ).

This ratio is an estimate of the specificity of the screening test.

DEFINITION

The specificity of a test (or symptom) is the probability of a negative test

result (or absence of the symptom) given the absence of the disease.

From the data in Table 3.5.1 we answer Question 3 by computing the conditional

probability estimate P D[ T ( ). This ratio is an estimate of a probability called the predictive

value positive of a screening test (or symptom).

DEFINITION

The predictive value positive of a screening test (or symptom) is the

probability that a subject has the disease given that the subject has a

positive screening test result (or has the symptom).

Similarly, the ratio P

D[

T ( ) is an estimate of the conditional probability that a subject

does not have the disease given that the subject has a negative screening test result (or does

not have the symptom). The probability estimated by this ratio is called the predictive value

negative of the screening test or symptom.

DEFINITION

The predictive value negative of a screening test (or symptom) is the

probability that a subject does not have the disease, given that the subject

has a negative screening test result (or does not have the symptom).

Estimates of the predictive value positive and predictive value negative of a test (or

symptom) may be obtained from knowledge of a test’s (or symptom’s) sensitivity and

specificity and the probability of the relevant disease in the general population. To obtain

these predictive value estimates, we make use of Bayes’s theorem. The following statement

of Bayes’s theorem, employing the notation established in Table 3.5.1, gives the predictive

value positive of a screening test (or symptom):

P D[ T ( ) =

P T [ D ( )P D ( )

P T [ D ( )P D ( ) ÷P T [

D ( )P

D ( )

(3.5.1)

80 CHAPTER 3 SOME BASIC PROBABILITY CONCEPTS

3GC03 11/07/2012 22:6:36 Page 81

It is instructive to examine the composition of Equation 3.5.1. We recall from

Equation 3.4.2 that the conditional probability P D[ T ( ) is equal to P D ¨ T ( )=P T ( ). To

understand the logic of Bayes’s theorem, we must recognize that the numerator of Equation

3.5.1 represents P D ¨ T ( ) and that the denominator represents P T ( ). We know from the

multiplication rule of probability given in Equation 3.4.1 that the numerator of Equation

3.5.1, P T [ D ( ) P D ( ), is equal to P D ¨ T ( ).

Now let us show that the denominator of Equation 3.5.1 is equal to P T ( ). We know

that event T is the result of a subject’s being classified as positive with respect to a

screening test (or classified as having the symptom). A subject classified as positive may

have the disease or may not have the disease. Therefore, the occurrence of T is the result

of a subject having the disease and being positive P D ¨ T ( ) [ [ or not having the disease

and being positive P

D ¨ T ( ) [ [. These two events are mutually exclusive (their intersec-

tion is zero), and consequently, by the addition rule given by Equation 3.4.3, we

may write

P T ( ) = P D ¨ T ( ) ÷P

D ¨ T ( ) (3.5.2)

Since, by the multiplication rule, P D ¨ T ( ) = P T [ D ( ) P D ( ) and P

D ¨ T ( ) =

P T [

D ( ) P

D ( ), we may rewrite Equation 3.5.2 as

P T ( ) = P T [ D ( )P D ( ) ÷P T [

D ( )P

D ( ) (3.5.3)

which is the denominator of Equation 3.5.1.

Note, also, that the numerator of Equation 3.5.1 is equal to the sensitivity times the

rate (prevalence) of the disease and the denominator is equal to the sensitivity times the rate

of the disease plus the term 1 minus the sensitivity times the term 1 minus the rate of the

disease. Thus, we see that the predictive value positive can be calculated from knowledge

of the sensitivity, specificity, and the rate of the disease.

Evaluation of Equation 3.5.1 answers Question 3. To answer Question 4 we

follow a now familiar line of reasoning to arrive at the following statement of Bayes’s

theorem:

P

D[

T ( ) =

P

T [

D ( )P

D ( )

P

T [

D ( )P

D ( ) ÷P

T [ D ( )P D ( )

(3.5.4)

Equation 3.5.4 allows us to compute an estimate of the probability that a subject who is

negative on the test (or has no symptom) does not have the disease, which is the predictive

value negative of a screening test or symptom.

We illustrate the use of Bayes’ theorem for calculating a predictive value positive

with the following example.

EXAMPLE 3.5.1

A medical research team wished to evaluate a proposed screening test for Alzheimer’s

disease. The test was given to a random sample of 450 patients with Alzheimer’s disease

and an independent random sample of 500 patients without symptoms of the disease.

3.5 BAYES’ THEOREM, SCREENING TESTS, SENSITIVITY, SPECIFICITY 81

3GC03 11/07/2012 22:6:36 Page 82

The two samples were drawn from populations of subjects who were 65 years of age or

older. The results are as follows:

Alzheimer’s Diagnosis?

Test Result Yes (D) No (

D) Total

Positive (T) 436 5 441

Negative (

T) 14 495 509

Total 450 500 950

Using these data we estimate the sensitivity of the test to be P(T [ D) = 436=450 = :97. The

specificity of the test is estimated to be P(

T [

D) = 495=500 = :99. We nowuse the results of

the study to compute the predictive value positive of the test. That is, we wish to estimate the

probability that a subject who is positive on the test has Alzheimer’s disease. From the

tabulated data we compute P(T [ D) = 436=450 = :9689 and P(T [

D) = 5=500 = :01.

Substitution of these results into Equation 3.5.1 gives

P(D[ T) =

(:9689)P(D)

(:9689)P(D) ÷(:01)P(

D)

(3.5.5)

We see that the predictive value positive of the test depends on the rate of the disease in the

relevant population in general. In this case the relevant population consists of subjects who

are 65 years of age or older. We emphasize that the rate of disease in the relevant general

population, P(D), cannot be computed fromthe sample data, since two independent samples

were drawnfromtwodifferent populations. We must lookelsewhere for an estimate of P(D).

Evans et al. (A-5) estimated that 11.3 percent of the U.S. population aged 65 and over have

Alzheimer’s disease. When we substitute this estimate of P(D) into Equation 3.5.5 we

obtain

P(D[ T) =

(:9689)(:113)

(:9689)(:113) ÷(:01)(1 ÷:113)

= :93

As we see, in this case, the predictive value of the test is very high.

Similarly, let us now consider the predictive value negative of the test. We have

already calculated all entries necessary except for P(

T [ D) = 14=450 = :0311. Using the

values previously obtained and our new value, we find

P(

D[ T) =

(:99)(1 ÷:113)

(:99)(1 ÷:113) ÷(:0311)(:113)

= :996

As we see, the predictive value negative is also quite high. &

82 CHAPTER 3 SOME BASIC PROBABILITY CONCEPTS

3GC03 11/07/2012 22:6:37 Page 83

EXERCISES

3.5.1 A medical research team wishes to assess the usefulness of a certain symptom (call it S) in the

diagnosis of a particular disease. In a random sample of 775 patients with the disease, 744 reported

having the symptom. In an independent random sample of 1380 subjects without the disease, 21

reported that they had the symptom.

(a) In the context of this exercise, what is a false positive?

(b) What is a false negative?

(c) Compute the sensitivity of the symptom.

(d) Compute the specificity of the symptom.

(e) Suppose it is known that the rate of the disease in the general population is. 001. What is the

predictive value positive of the symptom?

(f) What is the predictive value negative of the symptom?

(g) Find the predictive value positive and the predictive value negative for the symptom for the

following hypothetical disease rates: .0001, .01, and .10.

(h) What do you conclude about the predictive value of the symptom on the basis of the results

obtained in part g?

3.5.2 In an article entitled “Bucket-Handle Meniscal Tears of the Knee: Sensitivity and Specificity of MRI

signs,” Dorsay and Helms (A-6) performed a retrospective study of 71 knees scanned by MRI. One of

the indicators they examined was the absence of the “bow-tie sign” in the MRI as evidence of a

bucket-handle or “bucket-handle type” tear of the meniscus. In the study, surgery confirmed that 43 of

the 71 cases were bucket-handle tears. The cases may be cross-classified by “bow-tie sign” status and

surgical results as follows:

Tear Surgically

Confirmed (D)

Tear Surgically Confirmed As

Not Present

D ( ) Total

Positive Test

(absent bow-tie sign) (T)

38 10 48

Negative Test

(bow-tie sign present)

T ( )

5 18 23

Total 43 28 71

Source: Theodore A. Dorsay and Clyde A. Helms, “Bucket-handle Meniscal Tears of the Knee: Sensitivity

and Specificity of MRI Signs,” Skeletal Radiology, 32 (2003), 266–272.

(a) What is the sensitivity of testing to see if the absent bow tie sign indicates a meniscal tear?

(b) What is the specificity of testing to see if the absent bow tie sign indicates a meniscal tear?

(c) What additional information would you need to determine the predictive value of the test?

3.5.3 Oexle et al. (A-7) calculated the negative predictive value of a test for carriers of X-linked ornithine

transcarbamylase deficiency (OTCD—a disorder of the urea cycle). A test known as the “allopurinol

test” is often used as a screening device of potential carriers whose relatives are OTCD patients. They

cited a study by Brusilow and Horwich (A-8) that estimated the sensitivity of the allopurinol test as

.927. Oexle et al. themselves estimated the specificity of the allopurinol test as .997. Also they

estimated the prevalence in the population of individuals with OTCD as 1=32000. Use this

information and Bayes’s theorem to calculate the predictive value negative of the allopurinol

screening test.

EXERCISES 83

3GC03 11/07/2012 22:6:37 Page 84

3.6 SUMMARY

In this chapter some of the basic ideas and concepts of probability were presented. The

objective has been to provide enough of a “feel” for the subject so that the probabilistic

aspects of statistical inference can be more readily understood and appreciated when this

topic is presented later.

We defined probability as a number between 0 and 1 that measures the likelihood of

the occurrence of some event. We distinguished between subjective probability and

objective probability. Objective probability can be categorized further as classical or

relative frequency probability. After stating the three properties of probability, we defined

and illustrated the calculation of the following kinds of probabilities: marginal, joint, and

conditional. We also learned how to apply the addition and multiplication rules to find

certain probabilities. We learned the meaning of independent, mutually exclusive, and

complementary events. We learned the meaning of specificity, sensitivity, predictive value

positive, and predictive value negative as applied to a screening test or disease symptom.

Finally, we learned how to use Bayes’s theorem to calculate the probability that a subject

has a disease, given that the subject has a positive screening test result (or has the symptom

of interest).

SUMMARY OF FORMULAS FOR CHAPTER 3

Formula number Name Formula

3.2.1 Classical probability

P E ( ) =

m

N

3.2.2 Relative frequency

probability

P E ( ) =

m

n

3.3.1–3.3.3 Properties of probability P E

i

( ) _ 0

P E

1

( ) ÷P E

2

( ) ÷ ÷P E

n

( ) = 1

P E

i

÷E

j

À Á

= P E

i

( ) ÷P E

j

À Á

3.4.1 Multiplication rule P(A ¨ B) = P(B)P(A[ B) = P(A)P(B[ A)

3.4.2 Conditional probability

P(A[ B) =

P(A ¨ B)

P(B)

3.4.3 Addition rule P(A B) = P(A) ÷P(B) ÷P(A ¨ B)

3.4.4 Independent events P(A ¨ B) = P(A)P(B)

3.4.5 Complementary events P(

A) = 1 ÷P(A)

3.4.6 Marginal probability P(A

i

) =

P

P(A

i

¨ B

j

)

Sensitivity of a screening test

P(T [ D) =

a

(a ÷c)

Specificity of a screening test

P(

T [

D) =

d

(b ÷d)

84 CHAPTER 3 SOME BASIC PROBABILITY CONCEPTS

3GC03 11/07/2012 22:6:38 Page 85

3.5.1 Predictive value positive of a

screening test

P D[ T ( ) =

P T [ D ( )P D ( )

P T [ D ( )P D ( ) ÷P T [

D ( )P

D ( )

3.5.2 Predictive value negative of a

screening test

P

D[

T ( ) =

P

T [

D ( )P

D ( )

P

T [

D ( )P

D ( ) ÷P

T [ D ( )P D ( )

Symbol Key

v

D = disease

v

E = Event

v

m = the number of times an event E

i

occurs

v

n = sample size or the total number of times a process occurs

v

N = Population size or the total number of mutually exclusive and

equally likely events

v

P(

A) = a complementary event; the probability of an event A, not

occurring

v

P(E

i

) = probability of some event E

i

occurring

v

P(A ¨ B) = an “intersection” or “and” statement; the probability of

an event A and an event B occurring

v

P(A B) = an “union” or “or” statement; the probability of an event

A or an event B or both occurring

v

P(A[ B) = a conditional statement; the probability of an event A

occurring given that an event B has already occurred

v

T = test results

REVIEWQUESTIONS ANDEXERCISES

1. Define the following:

(a) Probability (b) Objective probability

(c) Subjective probability (d) Classical probability

(e) The relative frequency concept of probability (f) Mutually exclusive events

(g) Independence (h) Marginal probability

(i) Joint probability (j) Conditional probability

(k) The addition rule (l) The multiplication rule

(m) Complementary events (n) False positive

(o) False negative (p) Sensitivity

(q) Specificity (r) Predictive value positive

(s) Predictive value negative (t) Bayes’s theorem

2. Name and explain the three properties of probability.

3. Coughlin et al. (A-9) examined the breast and cervical screening practices of Hispanic and non-

Hispanic women in counties that approximate the U.S. southern border region. The study used data

from the Behavioral Risk Factor Surveillance System surveys of adults age 18 years or older

conducted in 1999 and 2000. The table below reports the number of observations of Hispanic and

non-Hispanic women who had received a mammogram in the past 2 years cross-classified with

marital status.

REVIEWQUESTIONS AND EXERCISES 85

3GC03 11/07/2012 22:6:38 Page 86

Marital Status Hispanic Non-Hispanic Total

Currently Married 319 738 1057

Divorced or Separated 130 329 459

Widowed 88 402 490

Never Married or Living As

an Unmarried Couple

41 95 136

Total 578 1564 2142

Source: Steven S. Coughlin, Robert J. Uhler, Thomas Richards, and Katherine

M. Wilson, “Breast and Cervical Cancer Screening Practices Among Hispanic

and Non-Hispanic Women Residing Near the United States–Mexico Border,

1999–2000,” Family and Community Health, 26 (2003), 130–139.

(a) We select at random a subject who had a mammogram. What is the probability that she is

divorced or separated?

(b) We select at random a subject who had a mammogram and learn that she is Hispanic. With that

information, what is the probability that she is married?

(c) We select at random a subject who had a mammogram. What is the probability that she is non-

Hispanic and divorced or separated?

(d) We select at random a subject who had a mammogram. What is the probability that she is

Hispanic or she is widowed?

(e) We select at random a subject who had a mammogram. What is the probability that she is not

married?

4. Swor et al. (A-10) looked at the effectiveness of cardiopulmonary resuscitation (CPR) training in

people over 55 years old. They compared the skill retention rates of subjects in this age group who

completed a course in traditional CPR instruction with those who received chest-compression only

cardiopulmonary resuscitation (CC-CPR). Independent groups were tested 3 months after training.

The table below shows the skill retention numbers in regard to overall competence as assessed by

video ratings done by two video evaluators.

Rated Overall

Competent CPR CC-CPR Total

Yes 12 15 27

No 15 14 29

Total 27 29 56

Source: Robert Swor, Scott Compton, Fern Vining, Lynn Ososky

Farr, Sue Kokko, Rebecca Pascual, and Raymond E. Jackson,

“A Randomized Controlled Trial of Chest Compression Only

CPR for Older Adults—a Pilot Study,” Resuscitation, 58 (2003),

177–185.

(a) Find the following probabilities and explain their meaning:

1. A randomly selected subject was enrolled in the CC-CPR class.

2. A randomly selected subject was rated competent.

3. A randomly selected subject was rated competent and was enrolled in the CPR course.

4. A randomly selected subject was rated competent or was enrolled in CC-CPR.

5. A Randomly selected subject was rated competent given that they enrolled in the CC-CPR

course.

86 CHAPTER 3 SOME BASIC PROBABILITY CONCEPTS

3GC03 11/07/2012 22:6:38 Page 87

(b) We define the following events to be

A = a subject enrolled in the CPR course

B = a subject enrolled in the CC-CPR course

C = a subject was evaluated as competent

D = a subject was evaluated as not competent

Then explain why each of the following equations is or is not a true statement:

1. P A ¨ C ( ) = P C ¨ A ( ) 2. P A B ( ) = P B A ( )

3. P A ( ) = P A C ( ) ÷P A D ( ) 4. P B C ( ) = P B ( ) ÷P C ( )

5. P D[ A ( ) = P D ( ) 6. P C ¨ B ( ) = P C ( )P B ( )

7. P A ¨ B ( ) = 0 8. P C ¨ B ( ) = P B ( )P C[ B ( )

9. P A ¨ D ( ) = P A ( )P A[D ( )

5. Pillman et al. (A-11) studied patients with acute brief episodes of psychoses. The researchers

classified subjects into four personality types: obsessiod, asthenic=low self-confident, asthenic=high

self-confident, nervous=tense, and undeterminable. The table belowcross-classifies these personality

types with three groups of subjects—those with acute and transient psychotic disorders (ATPD),

those with “positive” schizophrenia (PS), and those with bipolar schizo-affective disorder (BSAD):

Personality Type ATPD (1) PS (2) BSAD (3) Total

Obsessoid (O) 9 2 6 17

Asthenic=low Self-confident (A) 20 17 15 52

Asthenic=high Self-confident (S) 5 3 8 16

Nervous=tense (N) 4 7 4 15

Undeterminable (U) 4 13 9 26

Total 42 42 42 126

Source: Frank Pillmann, Raffaela Bloink, Sabine Balzuweit, Annette Haring, and

Andreas Marneros, “Personality and Social Interactions in Patients with Acute Brief

Psychoses,” Journal of Nervous and Mental Disease, 191 (2003), 503–508.

Find the following probabilities if a subject in this study is chosen at random:

(a) P(O) (b) P A 2 ( ) (c) P(1) (d) P

A ( )

(e) P A[ 3 ( ) (f) P

3) ( (g) P 2 ¨ 3 ( ) (h) P 2 [ A ( )

6. Acertain county health department has received 25 applications for an opening that exists for a public

health nurse. Of these applicants 10 are over 30 and 15 are under 30. Seventeen hold bachelor’s

degrees only, and eight have master’s degrees. Of those under 30, six have master’s degrees. If a

selection from among these 25 applicants is made at random, what is the probability that a person

over 30 or a person with a master’s degree will be selected?

7. The following table shows 1000 nursing school applicants classified according to scores made on a

college entrance examination and the quality of the high school from which they graduated, as rated

by a group of educators:

Quality of High Schools

Score Poor (P) Average (A) Superior (S) Total

Low (L) 105 60 55 220

Medium (M) 70 175 145 390

High (H) 25 65 300 390

Total 200 300 500 1000

REVIEWQUESTIONS AND EXERCISES 87

3GC03 11/07/2012 22:6:39 Page 88

(a) Calculate the probability that an applicant picked at random from this group:

1. Made a low score on the examination.

2. Graduated from a superior high school.

3. Made a low score on the examination and graduated from a superior high school.

4. Made a low score on the examination given that he or she graduated from a superior high

school.

5. Made a high score or graduated from a superior high school.

(b) Calculate the following probabilities:

1. P(A) 2. P(H) 3. P(M)

4. P(A[ H) 5. P(M ¨ P) 6. (H [ S)

8. If the probability that a public health nurse will find a client at home is .7, what is the probability

(assuming independence) that on two home visits made in a day both clients will be home?

9. For a variety of reasons, self-reported disease outcomes are frequently used without verification in

epidemiologic research. In a study by Parikh-Patel et al. (A-12), researchers looked at the relationship

between self-reported cancer cases and actual cases. They used the self-reported cancer data from a

California Teachers Study and validated the cancer cases by using the California Cancer Registry

data. The following table reports their findings for breast cancer:

Cancer Reported (A) Cancer in Registry (B) Cancer Not in Registry Total

Yes 2991 2244 5235

No 112 115849 115961

Total 3103 118093 121196

Source: Arti Parikh-Patel, Mark Allen, WilliamE. Wright, and the California Teachers Study Steering Committee,

“Validation of Self-reported Cancers in the California Teachers Study,” American Journal of Epidemiology,

157 (2003), 539–545.

(a) Let A be the event of reporting breast cancer in the California Teachers Study. Find the

probability of A in this study.

(b) Let B be the event of having breast cancer confirmed in the California Cancer Registry. Find the

probability of B in this study.

(c) Find P(A ¨ B)

(d) Find A[ B ( )

(e) Find P(B[ A)

(f) Find the sensitivity of using self-reported breast cancer as a predictor of actual breast cancer in

the California registry.

(g) Find the specificity of using self-reported breast cancer as a predictor of actual breast cancer in

the California registry.

10. In a certain population the probability that a randomly selected subject will have been exposed to

a certain allergen and experience a reaction to the allergen is .60. The probability is .8 that a

subject exposed to the allergen will experience an allergic reaction. If a subject is selected at

random from this population, what is the probability that he or she will have been exposed to the

allergen?

11. Suppose that 3 percent of the people in a population of adults have attempted suicide. It is also known

that 20 percent of the population are living below the poverty level. If these two events are

88 CHAPTER 3 SOME BASIC PROBABILITY CONCEPTS

3GC03 11/07/2012 22:6:39 Page 89

independent, what is the probability that a person selected at random from the population will have

attempted suicide and be living below the poverty level?

12. In a certain population of women 4 percent have had breast cancer, 20 percent are smokers, and 3

percent are smokers and have had breast cancer. Awoman is selected at random from the population.

What is the probability that she has had breast cancer or smokes or both?

13. The probability that a person selected at random from a population will exhibit the classic symptom

of a certain disease is .2, and the probability that a person selected at random has the disease is .23.

The probability that a person who has the symptom also has the disease is .18. A person selected at

random from the population does not have the symptom. What is the probability that the person has

the disease?

14. For a certain population we define the following events for mother’s age at time of giving birth: A =

under 20 years; B =20–24 years; C =25–29 years; D =30–44 years. Are the events A, B, C, and D

pairwise mutually exclusive?

15. Refer to Exercise 14. State in words the event E = (A B).

16. Refer to Exercise 14. State in words the event F = (B C).

17. Refer to Exercise 14. Comment on the event G = (A ¨ B).

18. For a certain population we define the following events with respect to plasma lipoprotein levels

(mg=dl): A = (10–15); B = (_ 30); C = (_ 20). Are the events A and B mutually exclusive? A and

C? B and C? Explain your answer to each question.

19. Refer to Exercise 18. State in words the meaning of the following events:

(a) A B (b) A ¨ B (c) A ¨ C (d) A C

20. Refer to Exercise 18. State in words the meaning of the following events:

(a)

A (b)

B (c)

C

21. Rothenberg et al. (A-13) investigated the effectiveness of using the Hologic Sahara Sonometer, a

portable device that measures bone mineral density (BMD) in the ankle, in predicting a fracture. They

used a Hologic estimated bone mineral density value of .57 as a cutoff. The results of the

investigation yielded the following data:

Confirmed Fracture

Present (D) Not Present

D ( ) Total

BMD = :57(T) 214 670 884

BMD > :57(

T) 73 330 403

Total 287 1000 1287

Source: Data provided courtesy of Ralph J. Rothenberg, M.D., Joan

L. Boyd, Ph.D., and John P. Holcomb, Ph.D.

(a) Calculate the sensitivity of using a BMDvalue of .57 as a cutoff value for predicting fracture and

interpret your results.

(b) Calculate the specificity of using a BMDvalue of .57 as a cutoff value for predicting fracture and

interpret your results.

REVIEWQUESTIONS AND EXERCISES 89

3GC03 11/07/2012 22:6:39 Page 90

22. Verma et al. (A-14) examined the use of heparin-PF4 ELISA screening for heparin-induced

thrombocytopenia (HIT) in critically ill patients. Using C-serotonin release assay (SRA) as the

way of validating HIT, the authors found that in 31 patients tested negative by SRA, 22 also tested

negative by heparin-PF4 ELISA.

(a) Calculate the specificity of the heparin-PF4 ELISA testing for HIT.

(b) Using a “literature derived sensitivity” of 95 percent and a prior probability of HIToccurrence as

3.1 percent, find the positive predictive value.

(c) Using the same information as part (b), find the negative predictive value.

23. The sensitivity of a screening test is .95, and its specificity is .85. The rate of the disease for which the

test is used is .002. What is the predictive value positive of the test?

Exercises for Use with Large Data Sets Available on the Following Website:

www.wiley.com /college/daniel

Refer to the random sample of 800 subjects from the North Carolina birth registry we investigated in

the Chapter 2 review exercises.

1. Create a table that cross-tabulates the counts of mothers in the classifications of whether the baby

was premature or not (PREMIE) and whether the mother admitted to smoking during pregnancy

(SMOKE) or not.

(a) Find the probability that a mother in this sample admitted to smoking.

(b) Find the probability that a mother in this sample had a premature baby.

(c) Find the probability that a mother in the sample had a premature baby given that the mother

admitted to smoking.

(d) Find the probability that a mother in the sample had a premature baby given that the mother

did not admit to smoking.

(e) Find the probability that a mother in the sample had a premature baby or that the mother did

not admit to smoking.

2. Create a table that cross-tabulates the counts of each mother’s marital status (MARITAL) and

whether she had a low birth weight baby (LOW).

(a) Find the probability a mother selected at random in this sample had a low birth weight baby.

(b) Find the probability a mother selected at random in this sample was married.

(c) Find the probability a mother selected at random in this sample had a low birth weight child

given that she was married.

(d) Find the probability a mother selected at random in this sample had a low birth weight child

given that she was not married.

(e) Find the probability a mother selected at random in this sample had a low birth weight child

and the mother was married.

REFERENCES

Methodology References

1. ALLAN GUT, An Intermediate Course in Probability, Springer-Verlag, New York, 1995.

2. RICHARD ISAAC, The Pleasures of Probability, Springer-Verlag, New York, 1995.

3. HAROLD J. LARSON, Introduction to Probability, Addison-Wesley, Reading, MA, 1995.

4. L. J. SAVAGE, Foundations of Statistics, Second Revised Edition, Dover, New York, 1972.

5. A. N. KOLMOGOROV, Foundations of the Theory of Probability, Chelsea, New York, 1964 (Original German edition

published in 1933).

90 CHAPTER 3 SOME BASIC PROBABILITY CONCEPTS

3GC03 11/07/2012 22:6:39 Page 91

Applications References

A-1. TASHA D. CARTER, EMANUELA MUNDO, SAGARV. PARKH, and JAMES L. KENNEDY, “Early Age at Onset as a Risk Factor

for Poor Outcome of Bipolar Disorder,” Journal of Psychiatric Research, 37 (2003), 297–303.

A-2. JOHN H. PORCERELLI, ROSEMARY COGAN, PATRICIA P. WEST, EDWARD A. ROSE, DAWN LAMBRECHT, KAREN E. WILSON,

RICHARD K. SEVERSON, and DUNIA KARANA, “Violent Victimization of Women and Men: Physical and Psychiatric

Symptoms,” Journal of the American Board of Family Practice, 16 (2003), 32–39.

A-3. DANIEL FERNANDO, ROBERT F. SCHILLING, JORGE FONTDEVILA, and NABILA EL-BASSEL, “Predictors of Sharing Drugs

among Injection Drug Users in the South Bronx: Implications for HIV Transmission,” Journal of Psychoactive

Drugs, 35 (2003), 227–236.

A-4. THOMAS A. LAVEIST and AMANI NURU-JETER, “Is Doctor-patient Race Concordance Associated with Greater

Satisfaction with Care?” Journal of Health and Social Behavior, 43 (2002), 296–306.

A-5. D. A. EVANS, P. A. SCHERR, N. R. COOK, M. S. ALBERT, H. H. FUNKENSTEIN, L. A. SMITH, L. E. HEBERT, T. T. WETLE,

L. G. BRANCH, M. CHOWN, C. H. HENNEKENS, and J. O. TAYLOR, “Estimated Prevalence of Alzheimer’s Disease in

the United States,” Milbank Quarterly, 68 (1990), 267–289.

A-6. THEODORE A. DORSAY and CLYDE A. HELMS, “Bucket-handle Meniscal Tears of the Knee: Sensitivity and

Specificity of MRI Signs,” Skeletal Radiology, 32 (2003), 266–272.

A-7. KONRAD OEXLE, LUISA BONAFE, and BEAT STENMANN, “Remark on Utility and Error Rates of the Allopurinol Test

in Detecting Mild Ornithine Transcarbamylase Deficiency,” Molecular Genetics and Metabolism, 76 (2002),

71–75.

A-8. S. W. BRUSILOW, A.L. HORWICH, “Urea Cycle Enzymes,” in: C. R. SCRIVER, A. L. BEAUDET, W. S. SLY, D. VALLE

(Eds.), The Metabolic and Molecular Bases of Inherited Disease, 8th ed., McGraw-Hill, New York, 2001,

pp. 1909–1963.

A-9. STEVEN S. COUGHLIN, ROBERT J. UHLER, THOMAS RICHARDS, and KATHERINE M. WILSON, “Breast and Cervical Cancer

Screening Practices Among Hispanic and Non-Hispanic Women Residing Near the United States-Mexico

Border, 1999–2000,” Family and Community Health, 26 (2003), 130–139.

A-10. ROBERT SWOR, SCOTT COMPTON, FERN VINING, LYNN OSOSKY FARR, SUE KOKKO, REBECCA PASCUAL, and RAYMOND E.

JACKSON, “A Randomized Controlled Trial of Chest Compression Only CPR for Older Adults—a Pilot Study,”

Resuscitation, 58 (2003), 177–185.

A-11. FRANK PILLMANN, RAFFAELA BL~oINK, SABINE BALZUWEIT, ANNETTE HARING, and ANDREAS MARNEROS, “Personality

and Social Interactions in Patients with Acute Brief Psychoses,” The Journal of Nervous and Mental Disease, 191

(2003), 503–508.

A-12. ARTI PARIKH-PATEL, MARK ALLEN, WILLIAM E. WRIGHT, and the California Teachers Study Steering Committee,

“Validation of Self-reported Cancers in the California Teachers Study,” American Journal of Epidemiology, 157

(2003), 539–545.

A-13. RALPH J. ROTHENBERG, JOAN L. BOYD, and JOHN P. HOLCOMB, “Quantitative Ultrasound of the Calcaneus as a

Screening Tool to Detect Osteoporosis: Different Reference Ranges for Caucasian Women, African-American

Women, and Caucasian Men,” Journal of Clinical Densitometry, 7 (2004), 101–110.

A-14. ARUN K. VERMA, MARC LEVINE, STEPHEN J. CARTER, and JOHN G. KELTON, “Frequency of Herparin-Induced

Thrombocytopenia in Critical Care Patients,” Pharmacotheray, 23 (2003), 645–753.

REFERENCES 91

3GC04 11/24/2012 13:51:41 Page 92

CHAPTER 4

PROBABILITY DISTRIBUTIONS

CHAPTER OVERVIEW

Probability distributions of randomvariables assume powerful roles in statis-

tical analyses. Sincetheyshowall possiblevalues of arandomvariableandthe

probabilities associated with these values, probability distributions may be

summarized in ways that enable researchers to easily make objective deci-

sions based on samples drawn from the populations that the distributions

represent. This chapter introduces frequently used discrete and continuous

probability distributions that are used in later chapters to make statistical

inferences.

TOPICS

4.1 INTRODUCTION

4.2 PROBABILITY DISTRIBUTIONS OF DISCRETE VARIABLES

4.3 THE BINOMIAL DISTRIBUTION

4.4 THE POISSON DISTRIBUTION

4.5 CONTINUOUS PROBABILITY DISTRIBUTIONS

4.6 THE NORMAL DISTRIBUTION

4.7 NORMAL DISTRIBUTION APPLICATIONS

4.8 SUMMARY

LEARNING OUTCOMES

After studying this chapter, the student will

1. understand selected discrete distributions and how to use them to calculate

probabilities in real-world problems.

2. understand selected continuous distributions and how to use them to calculate

probabilities in real-world problems.

3. be able to explain the similarities and differences between distributions of the

discrete type and the continuous type and when the use of each is appropriate.

92

3GC04 11/24/2012 13:51:41 Page 93

4.1 INTRODUCTION

In the preceding chapter we introduced the basic concepts of probability as well as methods

for calculating the probability of an event. We build on these concepts in the present chapter

and explore ways of calculating the probability of an event under somewhat more complex

conditions. In this chapter we shall see that the relationship between the values of a random

variable and the probabilities of their occurrence may be summarized by means of a device

called a probability distribution. A probability distribution may be expressed in the form of

a table, graph, or formula. Knowledge of the probability distribution of a random variable

provides the clinician and researcher with a powerful tool for summarizing and describing

a set of data and for reaching conclusions about a population of data on the basis of a

sample of data drawn from the population.

4.2 PROBABILITY DISTRIBUTIONS

OF DISCRETE VARIABLES

Let us begin our discussion of probability distributions by considering the probability

distribution of a discrete variable, which we shall define as follows:

DEFINITION

The probability distribution of a discrete random variable is a table,

graph, formula, or other device used to specify all possible values of a

discrete random variable along with their respective probabilities.

If we let the discrete probability distribution be represented by p x ( ), then p x ( ) =

P X = x ( ) is the probability of the discrete random variable X to assume a value x.

EXAMPLE 4.2.1

In an article appearing in the Journal of the American Dietetic Association, Holben et al.

(A-1) looked at food security status in families in the Appalachian region of southern Ohio.

The purpose of the study was to examine hunger rates of families with children in a local

Head Start program in Athens, Ohio. The survey instrument included the 18-question U.S.

Household Food Security Survey Module for measuring hunger and food security. In

addition, participants were asked how many food assistance programs they had used in the

last 12 months. Table 4.2.1 shows the number of food assistance programs used by subjects

in this sample.

We wish to construct the probability distribution of the discrete variable X, where

X = number of food assistance programs used by the study subjects.

Solution: The values of X are x

1

= 1; x

2

= 2; . . . ; x

7

= 7, and x

8

= 8. We compute the

probabilities for these values by dividing their respective frequencies by

the total, 297. Thus, for example, p x

1

( ) = P X = x

1

( ) = 62=297 = :2088.

4.2 PROBABILITY DISTRIBUTIONS OF DISCRETE VARIABLES 93

3GC04 11/24/2012 13:51:42 Page 94

We display the results in Table 4.2.2, which is the desired probability

distribution. &

Alternatively, we can present this probability distribution in the form of a graph, as in

Figure 4.2.1. In Figure 4.2.1 the length of each vertical bar indicates the probability for the

corresponding value of x.

It will be observed in Table 4.2.2 that the values of p x ( ) = P X = x ( ) are all

positive, they are all less than 1, and their sum is equal to 1. These are not phenomena

peculiar to this particular example, but are characteristics of all probability distributions

of discrete variables. If x

1

; x

2

; x

3

; . . . ; x

k

are all possible values of the discrete random

TABLE 4.2.1 Number of Assistance

Programs Utilized by Families with

Children in Head Start Programs in

Southern Ohio

Number of Programs Frequency

1 62

2 47

3 39

4 39

5 58

6 37

7 4

8 11

Total 297

Source: Data provided courtesy of David H. Holben,

Ph.D. and John P. Holcomb, Ph.D.

TABLE 4.2.2 Probability Distribution

of Programs Utilized by Families

Among the Subjects Described in

Example 4.2.1

Number of Programs (x) P X = x ( )

1 .2088

2 .1582

3 .1313

4 .1313

5 .1953

6 .1246

7 .0135

8 .0370

Total 1.0000

94 CHAPTER 4 PROBABILITY DISTRIBUTIONS

3GC04 11/24/2012 13:51:42 Page 95

variable X, then we may then give the following two essential properties of a probability

distribution of a discrete variable:

(1) 0 _ P X = x ( ) _ 1

(2)

P

P X = x ( ) = 1; for all x

The reader will also note that each of the probabilities in Table 4.2.2 is the relative

frequency of occurrence of the corresponding value of X.

With its probability distribution available to us, we can make probability statements

regarding the random variable X. We illustrate with some examples.

EXAMPLE 4.2.2

What is the probability that a randomly selected family used three assistance programs?

Solution: We may write the desired probability as p 3 ( ) = P X = 3 ( ). We see in

Table 4.2.2 that the answer is .1313. &

EXAMPLE 4.2.3

What is the probability that a randomly selected family used either one or two programs?

Solution: To answer this question, we use the addition rule for mutually exclusive

events. Using probability notation and the results in Table 4.2.2, we write the

answer as P 1 2 ( ) = P 1 ( ) ÷P 2 ( ) = :2088 ÷:1582 = :3670: &

0.00

0.05

0.10

0.15

0.20

0.25

P

r

o

b

a

b

i

l

i

t

y

x (number of assistance programs)

1 2 3 4 5 6 7 8

FIGURE 4.2.1 Graphical representation of the probability

distribution shown in Table 4.2.1.

4.2 PROBABILITY DISTRIBUTIONS OF DISCRETE VARIABLES 95

3GC04 11/24/2012 13:51:42 Page 96

Cumulative Distributions Sometimes it will be more convenient to work with

the cumulative probability distribution of a random variable. The cumulative probability

distribution for the discrete variable whose probability distribution is given in Table 4.2.2

may be obtained by successively adding the probabilities, P X = x

i

( ), given in the last

column. The cumulative probability for x

i

is written as F x

i

( ) = P X _ x

i

( ). It gives the

probability that X is less than or equal to a specified value, x

i

.

The resulting cumulative probability distribution is shown in Table 4.2.3. The graph

of the cumulative probability distribution is shown in Figure 4.2.2. The graph of a

cumulative probability distribution is called an ogive. In Figure 4.2.2 the graph of F(x)

consists solely of the horizontal lines. The vertical lines only give the graph a connected

appearance. The length of each vertical line represents the same probability as that of the

corresponding line in Figure 4.2.1. For example, the length of the vertical line at X = 3

in Figure 4.2.2 represents the same probability as the length of the line erected at X = 3 in

Figure 4.2.1, or .1313 on the vertical scale.

TABLE 4.2.3 Cumulative Probability Distribution of

Number of Programs Utilized by Families Among the

Subjects Described in Example 4.2.1

Number of Programs (x) Cumulative Frequency P X _ x ( )

1 .2088

2 .3670

3 .4983

4 .6296

5 .8249

6 .9495

7 .9630

8 1.0000

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1 2 3 5 4 6 7 8

x (number of programs)

f

(

x

)

FIGURE 4.2.2 Cumulative probability distribution of number of assistance programs

among the subjects described in Example 4.2.1.

96 CHAPTER 4 PROBABILITY DISTRIBUTIONS

3GC04 11/24/2012 13:51:42 Page 97

By consulting the cumulative probability distribution we may answer quickly

questions like those in the following examples.

EXAMPLE 4.2.4

What is the probability that a family picked at random used two or fewer assistance

programs?

Solution: The probability we seek may be found directly in Table 4.2.3 by reading the

cumulative probability opposite x = 2, and we see that it is .3670. That is,

P X _ 2 ( ) = :3670. We also may find the answer by inspecting Figure 4.2.2

and determining the height of the graph (as measured on the vertical axis)

above the value X = 2. &

EXAMPLE 4.2.5

What is the probability that a randomly selected family used fewer than four programs?

Solution: Since a family that used fewer than four programs used either one, two, or

three programs, the answer is the cumulative probability for 3. That is,

P X < 4 ( ) = P X _ 3 ( ) = :4983. &

EXAMPLE 4.2.6

What is the probability that a randomly selected family used five or more programs?

Solution: To find the answer we make use of the concept of complementary probabili-

ties. The set of families that used five or more programs is the complement of

the set of families that used fewer than five (that is, four or fewer) programs.

The sum of the two probabilities associated with these sets is equal to 1. We

write this relationship in probability notation as P X _ 5 ( ) ÷P X _ 4 ( ) = 1:

Therefore, P X _ 5 ( ) = 1 ÷P X _ 4 ( ) = 1 ÷:6296 = :3704. &

EXAMPLE 4.2.7

What is the probability that a randomly selected family used between three and five

programs, inclusive?

Solution: P X _ 5 ( ) = :8249 is the probability that a family used between one and five

programs, inclusive. To get the probability of between three and five

programs, we subtract, from .8249, the probability of two or fewer. Using

probability notation we write the answer as P 3 _ X _ 5 ( ) = P X _ 5 ( ) ÷

P X _ 2 ( ) = :8249 ÷:3670 = :4579. &

The probability distribution given in Table 4.2.1 was developed out of actual experience, so

to find another variable following this distribution would be coincidental. The probability

4.2 PROBABILITY DISTRIBUTIONS OF DISCRETE VARIABLES 97

3GC04 11/24/2012 13:51:42 Page 98

distributions of many variables of interest, however, can be determined or assumed on the

basis of theoretical considerations. In later sections, we study in detail three of these

theoretical probability distributions: the binomial, the Poisson, and the normal.

Mean and Variance of Discrete Probability Distributions The

mean and variance of a discrete probability distribution can easily be found using the

formulae below.

m =

X

xp(x) (4.2.1)

s

2

=

X

(x ÷m)

2

p(x) =

X

x

2

p(x) ÷m

2

(4.2.2)

where p(x) is the relative frequency of a given random variable X. The standard deviation is

simply the positive square root of the variance.

EXAMPLE 4.2.8

What are the mean, variance, and standard deviation of the distribution fromExample 4.2.1?

Solution:

m = (1)(:2088) ÷(2)(:1582) ÷(3)(:1313) ÷ ÷(8)(:0370) = 3:5589

s

2

= (1 ÷3:5589)

2

(:2088) ÷(2 ÷3:5589)

2

(:1582) ÷(3 ÷3:5589)

2

(:1313)

÷ ÷(8 ÷3:5589)

2

(:0370) = 3:8559

We therefore can conclude that the mean number of programs utilized was 3.5589 with a

variance of 3.8559. The standard deviation is therefore

ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

3:8559

_

= 1:9637 programs. &

EXERCISES

4.2.1. In a study by Cross et al. (A-2), patients who were involved in problem gambling treatment were

asked about co-occurring drug and alcohol addictions. Let the discrete random variable X represent

the number of co-occurring addictive substances used by the subjects. Table 4.2.4 summarizes the

frequency distribution for this random variable.

(a) Construct a table of the relative frequency and the cumulative frequency for this discrete

distribution.

(b) Construct a graph of the probability distribution and a graph representing the cumulative

probability distribution for these data.

4.2.2. Refer to Exercise 4.2.1.

(a) What is probability that an individual selected at random used five addictive substances?

(b) What is the probability that an individual selected at random used fewer than three addictive

substances?

(c) What is the probability that an individual selected at random used more than six addictive

substances?

(d) What is the probability that an individual selected at randomused between two and five addictive

substances, inclusive?

4.2.3. Refer to Exercise 4.2.1. Find the mean, variance, and standard deviation of this frequency distribution.

98 CHAPTER 4 PROBABILITY DISTRIBUTIONS

3GC04 11/24/2012 13:51:43 Page 99

4.3 THE BINOMIAL DISTRIBUTION

The binomial distribution is one of the most widely encountered probability distributions in

applied statistics. The distribution is derived from a process known as a Bernoulli trial,

named in honor of the Swiss mathematician James Bernoulli (1654–1705), who made

significant contributions in the field of probability, including, in particular, the binomial

distribution. When a random process or experiment, called a trial,