# Apologetics

## The challenge of reconciling our mental models with the material physical universe: Top-down and bottom-up approaches

A recurring theme that you'll find at POEM is how the practice of science is defined, in large measure, by its central value of seeking to avoid bias and by a collection of methods designed to assist scientists in avoiding bias when interpreting research results.

Even more than other methods for avoiding cognitive and logical traps, statistical measures are some of the most rigorous tools scientists have for providing clear frameworks for interpreting what the data from empirical observations and experiments actually mean.

To lay the foundation for discussing statistics in evaluating massage research, let's first talk about different approaches to the challenge of reconciling our mental models with the material physical world.

#### Data, information, facts, and truth

Data is a collection of factual information used as the basis for reasoning, discussion, or calculation. When a scientist talks about a fact that is rooted in research, they are referring to a piece of information that is being presented as objective reality.

Because that information is a fact, a scientist will often say "It is true that..." and then go on to state whatever that particular fact means.

It is easy for a casual listener to believe the scientist must be referring to absolute “Truth”, because of the way these words are commonly used in everyday conversation.

For example, the media may cover scientific topics in a way that implies that science points directly to “Truth” in the same way the term is used in philosophy or meaning-making and self-expression.

But this is not a faithful representation, because science—which deals only with aspects of the natural material physical universe—takes for granted that the measurement of things observed in the natural world contains a certain amount of error. By "error", we mean the Merriam-Webster dictionary meaning of "a variation in measurement, calculation, or observation due to mistakes or uncontrollable factors".

As we will discuss in Chapter 4 of the research literacy e-Book, it is impossible to observe or measure reality from a completely 100% neutral position, and there are no perfect measurement tools.

For this reason, scientists emphasize working in a way to obtain the best results possible, knowing that no observations of reality can be completely error-free.

There can be no achievement of absolute truth, just--if the process is carried out with integrity--getting closer and closer to what the facts are.

In order to work toward this goal, scientists have developed methods for managing observational errors, because those errors can be understood and controlled by making skillful choices about experimental design and statistical techniques.

The Semantic Triangle, introduced in Chapter 2 of the research literacy e-Book here at POEM and available later this month, shows how the elements of meaning can be divided among concepts (the meanings people attach to ideas), terms (the language used to describe ideas), and referents (the things in the natural world to which terms and concepts refer).

Source: http://sig.biostr.washington.edu/~raven/semantic-triangle.jpg accessed 2 May 2012

The big question is how to know—given that perceptions and experience vary so much from one person to another—that those concepts and terms in our minds really connect to the referents they claim to represent.

Sorting out how best to connect those internal aspects of meaning to the external physical world is an ongoing problem that challenges all of us.

#### Top-down vs. bottom-up approaches to data

One approach that has been taken throughout history is to decide in advance what the “truth” is, and then to look for empirically observed facts that will reinforce that “truth.”

This is known as the top-down approach, in which a researcher starts with a desired answer in mind and then fits the questions and the data into that answer.

Obviously, this approach implies a great deal of bias from the start.

Ptolemy, a Greek astronomer who lived in Egypt during the first and second centuries CE/AD, developed a model showing the sun and the planets in a circular orbit around the Earth. This model depicted the Earth at the center of everything, or geocentrism: a view that seemed at first to fit with what people observed when they looked up at the sky.

Source: http://upload.wikimedia.org/wikipedia/commons/7/7b/Bartolomeu_Velho_1568.jpg accessed 1 May 2012

But some careful observers noted that a planet such as Mars would sometimes be seen moving in its normal direction, but then it would come to a stop and begin to move in the opposite direction—backward across the sky—before returning to its expected path. It seemed to move in a retrograde way.

The left side of the drawing shows the Earth's actual motion around the sun in the blue points 1-5. Mars' actual motion around the sun is shown by the red points on the left of the diagram, and the right side of the diagram shows what Mars' motion looks like to an observer on the Earth. So there is no such thing as Mars (or Mercury, for that matter) in retrograde; it's actually an illusion produced by our motion relative to the other planet around the sun.

To reconcile this observation with the idea of the planets and sun making simple circles around the Earth, advocates for Ptolemaic astronomy used the concept of epicycles, or loops, that represented the additional movements of the planets. Epicycles were explained as looping paths that averaged out to simple circles. In the expanded Ptolemaic system, the planets and sun were continually looping around given points, which were themselves moving in simple perfect circles around the earth.

Source: http://upload.wikimedia.org/wikipedia/commons/2/29/Ptolemaic_elements.svg accessed 1 May 2012.

As in the previous image, Mars is shown in red, and Earth in blue. This is the model of epicycles introduced to account for what looked to observers on Earth to be retrograde motion.

Because of the observed referent (occasional apparent or seeming reversals in movement of the planets), it was necessary to add this new term and concept (epicycles) in order to hold onto and protect the Ptolemaic idea that something was moving in perfect circles around the earth. The advocates of Ptolemaic astronomy kept adding epicycles as necessary to force the model to fit the observations.

And for a very long time, despite the hacks and cobbled-together epicycle justifications, the Ptolemaic model continued to have a great influence on astronomy’s view of the Earth’s place in the universe, because there was not much change in the data available to observers.

But over time, new observational instruments such as telescopes were invented, and these made it possible to add new information to the accumulated body of knowledge about the sky.

Eventually, a tipping point was reached, and the weight of evidence made it clear that Ptolemy’s model of the universe no longer matched the observed facts.

A newer explanation, called the heliocentric model, was developed by Polish astronomer Nicolaus Copernicus (1473-1543) in which all the planets, including Earth, orbited around the sun.

Source: http://upload.wikimedia.org/wikipedia/commons/5/57/Heliocentric.jpg accessed 1 May 2012

Source: http://upload.wikimedia.org/wikipedia/commons/3/33/Geoz_wb_en.svg accessed 1 May 2012

A century later, Johannes Kepler introduced his laws of planetary motion, which demonstrated that the planets actually move in elliptical paths around the sun, not in perfect circles--a model which was an even better fit to the empirical data.

Those who insisted on retaining Ptolemy’s view of the universe, despite the growing evidence against it, were holding on to the top-down approach to data. They practiced apologetics, and used cherry-picking, special pleading, and other fallacious techniques, to protect their model from the challenge the material physical world confronted it with.

In contrast, the bottom-up approach of Copernicus and Kepler, who worked from the data to develop their conclusions, won out.

These new thinkers prevailed over the Ptolemaists because they were willing to let go of their previous beliefs (Kepler, in particular, was disappointed by the idea that planets moved in ellipses rather than in the perfect circular shapes he found so beautiful, but he followed his conscience in following the process where it led) and to let the data itself tell the story.

[Of course, by "prevailed", we never mean "100% accepted": there are, after all, modern-day adherents to the Flat Earth model in the incarnation of the Flat Earth society, just to name one example (Motto: "Replace the science religion...with SANITY.").

What we mean is that the majority of professionals, who have actually done the work to understand the domain, vouch for the work as having been carried out with integrity, and to be validated as showing the results it claims to demonstrate.]

Statistics is one methodology that we apply in a bottom-up approach to understand the meaning of the story that the data is telling us.

##### Exercise

Can you think of some real-life examples of where people try, or have tried, to protect an old model that has been discredited, despite the mounting evidence against it?

Areas where you might find examples nowadays include healthcare and politics, among others.

How far are some people prepared to go to protect old models?

What techniques do they use to do so?

What are the stakes--politically, psychologically, economically, and in other domains?

## Chapter 5: Just enough statistics

01. Why you might want to know this

This chapter provides a high-level overview of the basic concepts and vocabulary associated with statistics, the branch of mathematics that deals with the collection, analysis, interpretation, and presentation of numerical data. We will also cover some of the most important statistical measures encountered in massage research literature.

While statistics may initially seem somewhat intimidating, a few simple and useful concepts will go far in helping you to develop massage research literacy.

02. Industry-level massage educational and performance objectives addressed by this chapter

03. Learning objectives for this chapter

03a. Upon successful completion of this chapter, you will be able to do, know and understand, and value the following:

03b. Do

• name and explain the most common and most important statistical measures used in articles on massage research
• Using a sample massage research journal article, recognize and point out the mean in the statistical results.
• Using the lines on a boxplot, name the descriptive statistical values they represent.
• Recognize and point out a usage of standard deviation in a sample massage research article and explain what it means in context.

03c. Know and understand

• Name and describe the two primary ways in which statistical measures are used.
• Name and explain the most common and most important statistical measures used in articles on massage research.
• Explain the differences between a top-down approach to exploring knowledge and a bottom-up approach.
• Define the concept of a normal distribution, represented by a bell curve.
• Define the terms data range, average, and percentile.
• Define the terms variance and standard deviation.
• Name and explain three ways of measuring the average of a group of data values.
• Explain how a boxplot is used to represent significant values of data.

03d. Appreciate

• Discuss how scientists attempt to manage error and uncertainty through the use of specialized tools and techniques.
• Discuss how making good choices about experimental research design and appropriate statistical measures can compensate for the effects of experimental error.

04. Big ideas in this chapter

• Statistical measures provide frameworks and guidelines for researchers in their quest to minimize the effects of error in interpreting research results.
• Observation and measurement in the natural world inevitably contain a certain amount of error because no person can be absolutely neutral, and no measuring tools can measure perfectly.
• A top-down approach to exploring knowledge attempts to fit new information into previous beliefs or hypotheses.
• A bottom-up approach lets the explanation emerge from the observed data itself.
• To understand the meaning of the story being told by the data, researchers need ways to distinguish real patterns and trends from things that only seem to be patterns or trends.
• The concept of a normal distribution, represented by a bell curve, is one of the most useful and powerful statistical concepts and lies at the foundation of a great number of statistical techniques and measures.
• The concept of "normal" is closely tied to the concept of "average". "Average" refers to three statistical measures that describe data: the mean, the median, and the mode.
• The arithmetic mean, often referred to simply as mean, is calculated by adding all the values of the data together and dividing by the number of members in the group. If all the data values are fairly similar, the mean can be a good representation of the group, but if there are extremely high or low values in the data, the mean is not very representative of the group.
• The median represents the exact middle of a group of data; 50% of the data values fall above the median, and 50% of the data values fall below.
• The mode is the value that occurs most often in the data. Not all data sets have a mode, and some data sets have two or more modes.
• A percentile value is a threshold that indicates what percentage of the values fall below it. For example, 20% of values fall below the 20th percentile and 60% of values fall below the 60th percentile.
• A boxplot is a graphic device used to illustrate significant values of the data at a glance. Lines on and around the box correspond roughly to the values in the bell curve.
• The range of data includes all the values, from the lowest to the highest.
• Variance is a measure that describes how spread out, or dispersed, the data values are from one another.
• The most common representation of data dispersion is the standard deviation (SD). The SD indicates how representative the mean is of a set of data values.

#### Key terms in this chapter

• arithmetic mean
• average
• Average

• bell curve

• bottom-up approach
• box and whiskers diagram
• boxplot
• data
• descriptive statistics
• error
• error bars
• facts
• inferential statistics
• mean
• median
• mode
• normal distribution
• normal values
• percentile
• population
• range
• range of normal values
• standard deviation
• Statistics: mean, median, mode, standard deviation, power
• statistics
• top-down approach
• variance
• weight of evidence
• Mean

• Median

• Mode

• Percentile

• Standard deviation

• False positive error (type I error)

• False negative error (type II error)

• α (alpha)

• β (beta)

• p

• Confidence interval

• Sampling

• Power and sample size

• κ (kappa)

05. Key terms in this chapter

06. Claims made in this chapter

07. Entities and relationships for Ontology of Meaning in Massage in this chapter

08. Exercises

09. References cited in this chapter

1. Barlow A, Clarke R, Johnson N, Seabourne B, Thomas D, Gal J. Effect of massage of the hamstring muscle group on performance of the sit and reach test. Br J Sports Med. 2004 Jun;38(3):349-51.
2. Sakurai M, Suleman MI, Morioka N, Akça O, Sessler DI. Minute sphere acupressure does not reduce postoperative pain or morphine consumption. Anesth Analg. 2003 Feb;96(2):493-7.
3. Kshettry VR, Carole LF, Henly SJ, Sendelbach S, Kummer B. Complementary alternative medical therapies for heart surgery patients: feasibility, safety, and impact. Ann Thorac Surg. 2006 Jan;81(1):201-5.
4. Manikandan N. Effect of facial neuromuscular re-education on facial symmetry in patients with Bell's palsy: a randomized controlled trial. Clin Rehabil. 2007 Apr;21(4):338-43.
5. Sankaranarayanan K, Mondkar JA, Chauhan MM, Mascarenhas BM, Mainkar AR, Salvi RY. Oil massage in neonates: an open randomized controlled study of coconut versus mineral oil. Indian Pediatr. 2005 Sep;42(9):877-84.

10. Other learning resources

11. Introduction

The previous chapter, and this one, are probably the hardest chapters in the book. Once you get past these, it's relatively smooth sailing from here on out.

This one, especially, is rather math-y. But remember, there is nothing here that you can’t understand, if it is properly explained for us to do the work of integrating that knowledge. I suspect, that if we go through this together, you will find that a lot of things that look rather hard at first glance will begin to fall into place as we examine them. And I promise that if you “bear” with me, there’s something good waiting for you at the end :).

For our purposes, we are not going to go into a lot of rigorous detail about statistics. We will, however review some of the most common terms, so that you understand them when we encounter them. For the purpose of reading massage research articles, recognizing a few important ones--and understanding what they mean and why they are used--will take you very far.

The statistical understanding needed for reading research should not to be confused with what you need to carry out research, which I do wish to encourage—if you do get to the point where you actually carry out studies, you will need to know much more than I am presenting here. You would even need to consult a specialist to plan beforehand what statistics are appropriate for your study. So the take-home message is that for reading purposes, know the terms and concepts discussed here., But if you want to take it further, be aware that you will also have to further develop your understanding of statistics.

The obvious first question is: Why do we need to learn statistics at all?

Part of the answer is that we use statistics in order to ensure that our results aren’t just chance, but rather are the result of what we hypothesized caused them.

Additionally, we use statistics in research to avoid some of cognitive errors that we talked about in Chapter 4. Without analyzing the research results statistically, a result can often mislead us.

Statistics is also one way of describing, presenting, publishing, and sharing the research issues in the study. And finally, we use statistics to better understand how to apply the research results in our practice of massage.

So those are some of the reasons why statistics are important. But remember: don’t get bogged down in too many details—even the specialists argue over them.

Our strategy is to get familiar with most common and most important stats, and to just skim the others. If you know in general what the most common statistical measures are, and what they mean, you will be in a good position to understand the articles that use them. Some other important ones will be worth learning about, if you continue studying how to carry out research, but we won’t deal with them here, and we'll point them out if we come across them.

#### The Role of Statistics

The bottom-up approach of following the data wherever it may lead lies at the heart of why the scientific method has been so successful.

However, this strength comes with an associated challenge: how to distinguish real patterns in the data (that lead to answers) from false patterns (that indicate some kind of error).

This is a difficult task, and statistics is a key tool that has been developed and refined to provide validated guidelines for making judgments about accumulated knowledge.

Statistics aids the interpretation of meaning in terms of how things can vary from one another, how to lower errors in observation, how to know when two things are associated or in a cause-and-effect relationship, and how to classify things in meaningful ways.

In these ways, statistics lowers—though never totally eliminates—the risk of making errors in interpreting data.

There are two primary ways in which statistics are used, and different statistical measures are used for each of those purposes.

Descriptive statistics, as their name implies, are used to describe characteristics of data about a group. The data can be anything measured systematically (e.g., characteristics of mothers receiving a pregnancy massage treatment or heart-rate measurements taken over a period of several days among athletes receiving sports massage).

Inferential statistics are used to infer, or make predictions about, trends or patterns that are contained in the data and are used to distinguish real patterns from things that only seem to be patterns.

Inferential statistics are beyond the scope of this chapter, and won't be discussed here, since this chapter is intended only to cover the most important concepts needed to begin thinking in statistical terms.

#### Average: Mean, median, and mode

An average is an attempt to describe qualities of a group by combining qualities of individual members of the group.

When we use it in conversation, the term "average" can be somewhat imprecise. In statistics, however, we need the meaning of average to be more precise, and we accomplish this goal by using several statistical measures that describe data about a group, or population. These measures include the mean, the median, and the mode, which tell us some different things about the distribution of those individual qualities or values.

These measures represent three different approaches to averaging, each of which is useful in different situations, and each has its own strengths and weaknesses.

Mean

The arithmetic mean is the method of averaging values that is the most common type of mean encountered in reading massage research literature. You're very likely to be familiar with it already, since it's commonly used to calculate grades. The arithmetic mean is calculated by adding all the values of the data points together and dividing that cumulative number by the total number of data points.

Arithmetic mean (average) value of 5 final exam grades.

Figure 9.1: Mean (average) value of 5 final exam grades.

The term "mean" by itself is frequently used in the literature as shorthand for arithmetic mean. If you ever do encounter another, less common measurement of mean used in a study, the specific type, such as geometric mean or harmonic mean, will be explicitly noted (and if you do, I'd love to hear about it, since I've never seen them used in massage research).

The mean is often represented in the literature by the letters m̄, m, M, x̄, x, or X. For example, the phrase in parenthesese in this statement in a massage research article that

Forty-eight children (M age = 4.8 years) infected with HIV/AIDS and living in the Dominican Republic were randomly assigned to a massage therapy or a play session control group.

indicates that the 48 children in the study were, on average (mean), 4.8 years old.

Figure 4-5 shows data excerpted from a study that investigated whether a single massage treatment would alter the flexibility of the hamstring muscles in physically active young men, as measured by the observed value on the sit-and-reach test.1 Barlow 2004 investigated whether a single massage would alter the flexibility of the hamstring in physically-active young men, as measured by the mean (average) value on the sit-and-reach test.

Source: School of Applied Sciences, University of Glamorgan, Pontypridd, Wales, UK.

Abstract

OBJECTIVE: To investigate if a single massage of the hamstring muscle group would alter the performance of the sit and reach test.

METHODS: Before treatment, each of 11 male subjects performed the sit and reach test. The treatment consisted of either massage of the hamstring muscle group (both legs, total time about 15 minutes) or supine rest with no massage. Performance of the sit and reach test was repeated after treatment. Each subject returned the subsequent week to perform the tests again, receiving the alternative treatment relative to their initial visit. Mean percentage changes in sit and reach scores after treatment were calculated for the massage and no massage treatments, and analysed using Student's t tests.

RESULTS: Mean (SD) percentage changes in sit and reach scores after massage and no massage were small (6.0 (4.3)% and 4.6 (4.8)% respectively) and not significantly different for subjects with relatively high (15 cm and above) values before treatment. Mean percentage changes in sit and reach scores for subjects with relatively low values before treatment (below 15 cm) were large (18.2 (8.2)% and 15.5 (16.2)% respectively), but no significant differences were found between the massage and no massage groups.

CONCLUSIONS: A single massage of the hamstring muscle group was not associated with any significant increase in sit and reach performance immediately after treatment in physically active young men.

He included his data in Table 1, so we can calculate the mean of all the sit and reach scores for the subjects (1) before and (2) after the massage by adding all the values in the appropriate column, and then dividing by 11 (the number of subjects in the study):

Barlow's Table 1

Averages for Barlow's data

The data shows that the mean (average) sit-and-reach scores for the subjects before massage treatment was 16.64, and after one massage treatment, the mean score increased to 18.55. The mean was calculated for the before and after measurements by adding the values in each column and then dividing the total by 11 (the number of subjects in the study).

The disadvantage of the mean measurement is that if any of the data being averaged is extremely high or extremely low, the mean can be so different from that data that it does not give an accurate description, especially when there are only a few data points. An easy example to illustrate this problem is to imagine a millionaire and a homeless person as the two people standing in line at the post office on a given day. The millionaire’s net worth ($1,000,000) added to the net worth of the homeless person (0) =$1,000,000. That total divided by 2 (two subjects) = $500,000. So it could be said that the mean net worth of everyone in line at the post office at that time would be$500,000. While this is a true statement in mathematical terms, it does not accurately describe the financial situation of the two people in line at the post office that day, dramatically demonstrating how the mean fails when confronted with populations that do not distribute in a normal (bell-curve) manner

The disadvantage of the mean is that it can’t tell you about extreme values in the data, or how any individual compares to the group, except in the most crudely approximate way. In order to examine this limitation further, let’s set up our own table including the mean score (shown in the callouts in table 1 on the previous page). In the last column, observe the difference in score for each subject from the mean score.

Median

The median is representative of the data in a different way from the mean—it is the value above which half the data falls, and below which the other half of the data falls.

Exercise:

Let’s find the median score for the before massage group in Barlow’s sit-and-reach study. First, we’ll rearrange our table in descending order, so that the sit-and-reach scores go from highest to lowest.

You'll noticee that sometimes in the real world, things are not really cut-and dried—with this data there are 4 scores above the medium score, and 5 scores above the Medium score. The scores for subject 10 and for subject 6 were as close to a middle score as we could get. Therefore the median value is roughly in the middle.

Figure 9.2: Sit-and-reach scores in centimeters (Barlow 2004).

Note that here there are exactly 4 scores above and 4 scores below the score of 17, which is the score for subjects 7,6 2. So in this case the median value is exactly in the middle.

Here is another example of a researcher reporting medium values. Note: just focus on the black text; we’ll get to percentiles in a little bit.

* Data are reported as median (25th percentile, 75th percentile). Fifty-three patients (30 controls and 23 minute spheres) completed the study.

** Morphine requirements (47 mg [27, 58] vs ***41 mg [25, 69]) and pain scores (29.5 mm [16, 59] vs 40 mm [22, 58]) were similar in the control and acupressure groups. (Sakurai 2003)

9.3 Percentile

If a value is in the 99th percentile, that means that 99% of the values are lower. For a value in the 60th percentile, 60% of the values are lower; for a value in the 30th percentile, 30% of the values are lower, and so forth.

Now let’s revisit the excerpt from Sakurai 2003 on the previous page . Note that the previously white print is now black bolded because we are going to discuss that now.

Figure 9.3: text.

Figure 9.4: text.

*Data are reported as median (25th percentile, 75th percentile). Fifty-three patients (30 controls and 23 minute spheres) completed the study. **Morphine requirements (47 mg [27, 58] vs ***41 mg [25, 69]) and pain scores (29.5 mm [16, 59] vs 40 mm [22, 58]) were similar in the control and acupressure groups. (Sakurai 2003)

Interpreting what is this says is not as difficult as it first looks. Below illustrates how this shorthand gets translated. As you can see, it not really that difficult.

* “Data are reported as median (25th percentile, 75th percentile).”

184

Figure 9.5: text.

Figure 9.6: text.

**“Morphine requirements (47 mg [27, 58] vs *** 41 mg [25, 69])”

9.4 Mode

The mode represents the data in a different way than either the mean or themedian. The mode is the value which occurs most frequently in the data. Inother words, it is the value “most typical” of the population.

A data set can have no mode (if all the values are unique—although some authorsconsider this to mean that every value is a mode), one mode (if one particularvalue is the most frequently-occurring value), or more than one mode (if multiplevalues are tied for the most-frequently occurring value).

Figure 9.7: text.

This data set has two modes, 14 and 13.

This data set has one mode, 17.

You won’t come across mode too often as its meaningfulness in research applies more narrowly than mean and percentile. Perhaps give an example of a fictitiousresearch experiment where MODE would be meaningful.

The statistics we have covered up until now are useful, but in order to get aclearer picture of what all the data looks like, there are more refined tools we canuse to understand the relationships among the values. Standard deviation (SD)is one of those tools

9.5 Preparing to discuss standard deviation

This (standard deviation) is probably the hardest concept we are going to cover.But it is worth it, because of the value of the concept and its applicability to somany different situations. So let’s break this up into small pieces to tackle onepiece at a time, and see how we can use it, not only in reading massage research,but in many other situations as well.

I remember when a sad event in my childhood brought home to me the concept of a population, although I certainly didn’t think about it that way at the time.

Figure 9.8: text.

When I was in fifth grade, a child at my school died. Although I didn’t know the child personally, I was sad to hear the news, as was everyone else there. ThenI started putting it together with what had happened the year before, whenanother child had died. I figured out that there must be some kind of rule thatevery year one child dies at our school, and that next year it could be me. Thatparticular thought was scary enough to keep me up awake for a couple of nights.

Although I was kind of on the right track in certain ways, there were some flaws in my analysis; however, as I was 10 years old at the time, I think I can be forgiven for a certain lack of mathematical rigor. The observation that there was a pattern—the death of one child per year—was a reasonable observationfor that very short time span, although if I had been paying attention longer, itis possible that there would have been many other years where no child at thatschool died.

But from that observation of a pattern, I went a little too far in imaging a “rule”that one child died every year—it would be better to think of it as a descriptionof what did happen, rather than as a prescription for what must happen. If youthink of it in that way, you can see one function that statistics serves—descriptivestatistics summarizes the data about a population or a study, and describes inwhat way they are similar (central tendency) or different (variability). It takes avery diverse group, and tries to convey concisely and efficiently to the audiencewhat the important measures of that group are. The statistical measures we havegone over up until now—mean, median, mode, and percentile—are descriptivestatistics.

Figure 9.9: text.

Inferential statistics takes things a step farther—it lets us use reasoning to infer, or make predictions, about the group, based on what we already know. It’s what I was dimly sensing when I realized that another could die at my school the next year1, and so came up with my “rule”. The statistics we are going to talk about now are inferential statistics, and understanding the concepts of normal distribution, standard deviation, types of error, sample size and power, and inter-observer agreement will make a great deal—even most—of the massage

research literature accessible to you.

1I was, and still am, quite happy to have been proved wrong on that prediction.

9.6 Standard deviation

Standard deviation has a lot in common with the averages we discussed earlier, and we will talk about how we can use it as a kind of descriptive statistic. To understand standard deviation, however, we first have to all be on the same page about what normal distribution means, so we’re going to talk about that first, and then come back to standard deviation.

9.6.1 Normal distribution

We talked earlier in Chapter 2 about how “normal” is one of those words that has a specific, neutral meaning in science, yet has very strong connotations in everyday language. It’s unfortunate that this word is so heavily loaded, as it is one of the most useful and powerful statistical concepts there is, and serves as a gateway to the world of inferential statistics. The word has been used as a weapon to enforce social and medical agendas—after I have taught a session on massage research and fibromyalgia, I’ve had people some up to me afterwards and tell me how painful it is to be told they are not “normal”, where “normal” is a prescriptive word for how they should be. Let’s be very clear that this is not how we’re using the word. Our specific statistical use of the word is defined below.

First of all, think about a situation you’ve been in with a lot of other people—a lot of the time, a few people are extreme in some value one way or the other, but most people are pretty close to average. We’ve all been born, so let’s consider the weight at birth in all healthy babies born in the US as our example situation. A few very big babies: 8 (1/2) to 9 pounds, A few very small babies: 6 to 6 (1/2) pounds Most babies somewhere around 7 or 8 pounds or so, more or less: called normal birthweight because it forms a normal distribution.

This is what that normal distribution looks like. The curved line is called a bell curve—a pretty descriptive name, because it is indeed shaped like a bell [bell character].

Figure 9.10: text.

Figure 9-2: Bell curve showing normal distribution of birthweights

While all bell curves have the same basic features of a small “tail” at either end (representing a few extreme values) and a large “bump” in the middle (representing a lot of typical values), there can still be some dramatic differences in how the data the bell curve represents is arranged. The following are both bellcurves, but look how different they are from each other:

Figure 9-3: Two different bell curves

Figure 9.11: text.

• The graph on the left is tall and narrow and drops off sharply.

• The graph on the right is shorter and drops off much more gently.

These differences are useful, because they tell something about the data being studied—namely, about how different the extreme values are from the more typical values for that population. The standard deviation, which is coming up, will explore that distinction in more detail. So now that we are familiar with normaldistributions and bell curves, let’s return to standard deviation, and see how that helps us with reading the massage research literature.

9.7 Back to standard deviation

We discussed earlier that the mean can sometimes be a useful way to summarize and describe the data. But the mean can be so different from that data that it does not give an accurate description of that data because the data under study is extremely high or extremely low. To put it another way, according to the “Bill Gates Net Worth” web page2 just now, at this moment, Bill Gates’ net worth is $27,600,000,000 (give or take). So, if I told you that on average, Bill Gates and I each have a net worth of$13,800,000,000—did you really learn anything relevant and useful about me3? Or did you just get a graphic demonstration of how badly the mean fails when it has to deal with extreme values?

2Yes, there really are some people with that much spare time on their hands. You can find it at:

http://bgnw.marcus5.net/bgnw.html if you like.

3If only! :)

Clearly, we need a better tool for describing populations that—like our big, small, and average-sized babies—exhibit a great deal of variation, and the standarddeviation (SD) is one of those tools we can use. We won’t bother with the mathematics behind the SD here, because for our purposes, I just want you to be able to recognize it when you come across it in the literature, and to understand what it means.

Sometimes you’ll see the SD called the mean of the mean [ref]—that refers to the way it is computed mathematically, and also to the way it describes data more accurately than just the mean alone does. Assuming a normal distributionof data (our bell curve), the standard deviation describes where in the bell curve the data lies. And so the normal distribution and standard deviation can deal with extreme data as well as more representative data. Further, a large standard deviation can indicate to the reader that there is something wrong with the data, or with the model, or with both.

First of all, let’s put some further meaning on our bell curve. Below we have a bell curve where the different sections are shaded.

Figure 9.12: text.

Figure 9-4: A bell curve with standard deviations

• The solid gray area, referred to as 1 standard deviation from the mean, represents the largest number of data values. Values that fall in this area of the graph are considered the most normal. We expect about 68% of our values to lie somewhere within this range.

• The striped area, referred to as 2 standard deviations from the mean, represents a larger number of the values. We expect about 98% of our values to lie within this range (notice that to get from one striped range to the other, we have to go through the gray ranges, so we include that previous 68% in our estimate of 98%).

• The data in the black area, referred to as 3 standard deviations from themean, represents a small percentage of the values. We expect about 99.7% of our values to lie within this range (notice that to get from one black range to the other, we have to go through the striped ranges and the gray ranges, so we include that 98% in our estimate of about 99.7%).

So now you can begin to see how this addresses the problem with the mean and the extreme values that we’ve encountered—if we know the mean, and we knowhow far away (how spread-out) from the mean a particular value of data is, then we have a much more powerful tool for accurately and clearly representing the data than the mean alone is able to provide4.

This is useful because it tells us how “spread-out” the population is. The larger the SD, the more chance you should be somewhat skeptical of the study. Remember our previous two bell curves?

Figure 9-5: Two different bell curves with varying standard deviations

A false positive error (also called a type I error), for our purposes, exists when it looks like the treatment, such as massage, caused an effect when it really didn’t. In other word, its positive result was false. Here is a hypothetical research experiment to see the effect of massage on blood pressure to illustrate how this can happen.

4So you get a much more accurate description of where I really am if I tell you that on average, Bill Gates

and I each have a net worth of $13,800,000,000.00, and that I am more than 3 SD away from that mean. Let’s just leave it at that for now. :) Figure 9.13: text. Figure 9.14: text. Figure 9-6 In this experiment, the researchers concluded that massage does indeed lower blood pressure. But suppose the researchers made a change in the experimental design and instead of having the control subjects sit in a chair for one hour, they lay down on the massage table for an hour which caused a different result for the control group. Figure 9.15: text. Figure 9-7 With this experimental design lying down on the table for one hour, without being massaged, also lowered blood pressure. The conclusion from this experimental design would be that lying on the table, and not the massage itself, lowers blood pressure. Note that this was not a real experiment and that this may or may notbe true. Also note that this is an example of an experiment that any massage therapist can carry out. A false negative error (also called a type II error) exists when it looks like the treatment, such as massage, had no effect, but it really did. Here is another hypothetical experiment that demonstrates a false negative error. Suppose the 1 hour massage and also just lying on the table for an hour both resulted in no change in blood pressure. Figure 9.16: text. Figure 9-8 However, in this hypothetical experiment, the researcher hypothetically did not pay attention to a couple of important factors. One was that the massage therapist being used to perform the massages was only available on Monday. And on Monday, there were workmen using jackhammers just outside the window. Also, suppose it is summer and the windows are open. However, when the subjects in the control group came participate in the study by just lying on the table without getting a massage, it was later in the week and the workmen were gone. Figure 9-9 If the researcher is unaware of the jackhammer annoyance factor, he will conclude that massage does not lower blood pressure. However it is possible that the jackhammer noise was having a blood pressure elevation effect which masked the the blood pressure lowering effect of the massage. Here is a diagram that illustrates this. Figure 9.17: text. Figure 9.18: text. Figure 9-9 Alpha () The statistical measure alpha is the probability of making a false positive error. In most of the research literature which you see, the researcher will tend to set alpha at about 0.05. That 0.05 means a 5% risk making a false positive error. In the example below, that is what Hopper sets his at, and he concludes that, to a 5% or less probability of seeing an effect that is not really present, that his intervention (dynamic soft-tissue mobilization) significantly increased hamstring flexibility in the healthy male subjects he studied. Example: OBJECTIVES: The purpose of this study was to investigate the effect of dynamic soft tissue mobilisation (STM) on hamstring flexibility in healthy male subjects...The alpha level was set at 0.05. RESULTS: Increase in hamstring flexibility was significantly greater in the dynamic STM group than either the control or classic STM groups with mean (standard deviation) increase in degrees in the HFA measures of 4.7 (4.8), -0.04 (4.8), and 1.3 (3.8), respectively. CONCLUSIONS: Dynamic soft tissue mobilisation (STM) significantly increased hamstring flexibility in healthy male subjects. (Hopper 2005) (compare this with our other hamstring study, too). (what other hamstring study?) Figure 9.19: text. alpha (α) The statistical measure α is the probability of making a false positive error. In most of the research literature which you see , the researcher will tend to set α at about 0.05. That 0.05 means a 5% risk making a false positive error. In the example below, that is what Hopper sets his α at, and he concludes that, to a 5% or less probability of seeing an effect that is not really present, that his intervention (dynamic soft-tissue mobilization) significantly increased hamstring flexibility in the healthy male subjects he studied. Figure 9.19: text. Example: OBJECTIVES: The purpose of this study was to investigate the effect of dynamic soft tissue mobilisation (STM) on hamstring flexibility in healthy male subjects...The alpha level was set at 0.05. RESULTS: Increase in hamstring flexibility was significantly greater in the dynamic STM group than either the control or classic STM groups with mean (standard deviation) increase in degrees in the HFA measures of 4.7 (4.8), -0.04 (4.8), and 1.3 (3.8), respectively. CONCLUSIONS: Dynamic soft tissue mobilisation (STM) significantly increased hamstring flexibility in healthy male subjects. (Hopper 2005) (compare this with our other hamstring study, too). (what other hamstring study?) beta (β) Don’t worry too much about β: most massage research studies don’t address it explicitly in their published reports. But you may come across it, and since it is kind of the “mirror-image” of , I’ll just include it here for your reference more than anything. Just like α is the probability of making a false positive error, β is the probability of making a false negative error. For example if α is the risk you run of seeing a bear that isn’t there, β is the risk you run of denying a bear that really is there when it isn’t. figure out a picture to illustrate Beta (β) Don’t worry too much about β: most massage research studies don’t address it explicitly in their published reports. But you may come across it, and since it is kind of the “mirror-image” of α, I’ll just include it here for your reference more than anything. Just like is the probability of making a false positive error, is the probability of making a false negative error. For example if α is the risk you run of seeing a bear that isn’t there, β is the risk you run of denying a bear that really is there when it isn’t. 9.8 p-value My biostatistics professor would have an aneurysm (sorry, Dr. L.!) if he saw how we are going to treat the concept of p-value. And in the bigger picture, he would be right—it is a misunderstood and misused statistical measure, and deserves a fuller and richer treatment by experimenters and statisticians. On the other hand, the purpose of this book is to give you enough information to read massage research, not to turn you into a specialist in any given area in experimental design. So our strategy will be to understand p-value enough to use it the way most clinicians do to read research articles. and we will note that in itself, that does not fully do justice to the concept. 9.9 Sampling 9.9.1 Power and sample size Confidence interval and confidence level In the political season (and at other times, too), we often see poll results reportedas a certain set of results, plus or minus a particular margin of error. Although it’s clear from the context that it means a little uncertainty in the exactresults, now that we have discussed the normal distribution, we can understandit in a little more depth. If the poll accurately reflects the population at large, and if we repeated the pollmultiple times, we would expect the results to be about the same, with only alittle bit of variation. The amount that it can vary—positive or negative, sinceit can vary either way—is the margin of error. So if Candidate A is preferred by68% of the population, and Candidate B by 32%, with a margin of error of +/-5%, that means that either candidate’s number could be as much as 5% too highor too low in this poll. So in reality, Candidate A may have anywhere from 63%to 73%, and Candidate B may have anywhere from 27% to 37%. That positiveand negative variation around the reported percentage is the margin of error,which leads us into the concept of a confidence interval. GIGO. Statistical deadheat. You can think of the confidence interval as a band or a range around the reportedvalue—the “true” number lies somewhere within that band. The confidence level,by contrast, reports how confident we are that the true result lies within thatband. examples ##### κ (kappa) Remember how back in Chapter 2, we discussed how, in order to talk scientifically to each other about bears, we first had to both agree that there really is a real-world referent bear? Kappa (or the Greek letter κ) is the measurement we use to see how much different observers agree on the “bear”—the subject or entity under study. Think of kappa as a percentage, written in decimal-number form. So the highest (best) value kappa can have is 100% agreement, written as 1.00. Examples of less than perfect (less than 100 0.3 (30%) 0.5 (50%) 0.7 (70%) Assigning value judgments (“good”, “moderate”, “poor”) to those kappa values is a matter of interpretation, and different experts differ on what the numbers mean. Let’s look at the next three studies on agreement in reflexology first individually, then as a set, because it opens a bigger question—how do we compare values across studies? The aim of this study was to investigate whether [reflexology] can be used as a valid method of diagnosis...Inter-rater reliability (kappa) scores were very low, providing no evidence of agreement between the examiners. CONCLUSION: Despite certain limitations to the data provided by this study, the results do not suggest that reflexology techniques are a valid method of diagnosis. (White 2000) White is saying that the people who were using reflexology for diagnosis had very low agreement—in other words, their diagnoses were very inconsistent from one person to the next. Presumably, White would agree with criticisms on reflexology theory on the basis that it is internally inconsistent, if it can lead to so much inter-observer variation. In other words, we can’t agree that there is a bear there. We wanted to test the specific theory behind foot reflexology. Three reflexotherapists examined 76 patients of whom they had no previous knowledge...Interrater agreement, measured by weighted Kappa, ranged from 0.04 to 0.22, and was significantly better than chance (p < 0.05) for six parts of the body. The overall Kappa was 0.11 (95% CI: 0.08- 0.14)...The statistical agreement may be better than pure chance, but is too low to be of any clinical significance. (Baerheim 1998) (See? There’s our p and our confidence interval, as well!) Baerheim does not dispute that the agreement among the three reflexotherapists is better than pure chance would lead us to expect, but is not convinced that this finding translates into anything that would be useful in real-world practice (clinical significance). So in other words, something seems to be there, but we can’t agree on whether or not it is a bear and what it means. AIM: The purpose of this study was to test the reliability and validity of the reflexological diagnosis method. METHODS: Eighty patients from various clinics and departments in the Hillel Yaffe Medical Center, Hadera, were examined twice by two different reflexologists. The diagnostics that resulted from these examinations were compared with the conventional medical diagnostics of the same patients. In addition, the level of correlation between the two reflexological examinations was tested. RESULTS: Out of 18 body systems in 6 a statistically significant correlation was found between the conventional medical diagnosis and the two reflexological examinations. In 4 body systems, there was a statistically significant correlation between the conventional medical diagnosis and one out of the two reflexological examinations. The systems in which correlation was found are characterized by having a defined anatomic region. The examination of the significance of the diagnoses regarding the components of the body systems resulted in statistical significance in only 4 out of the 32 components. Between the two reflexological examinations, a statistically significant correlation was found in 14 out of the 18 body systems, and in only 15 out of the 32 system components. CONCLUSION: The reflexology method has the ability to diagnose (reliable and valid) at a systematic level only, and this is applicable only to those body systems that represent organs and regions with an exact anatomic location. (Raz 2003) Raz found more agreement than either Baerheim or White, and drew the conclusion that in certain limited situations (systematic level, and only those systems with exact anatomical locations), reflexology diagnoses are reliably and validly consistent, not only with other reflexology diagnoses, but also with diagnoses from conventional medicine. So can we put all these studies together into a meta-analysis? No—they are studying issues which are subtly different, but that difference is enough to make it comparing apples and oranges. What we can do is think about it at the very abstract level as “all of these studies address issues in inter-observer agreement among reflexologists, and they find varying degrees of consistency, from ’very low’ to ’reliable and valid at a systematic level only”’—but we can’t combine them into a meta-analysis on that basis. So what do you think explains the different results? How would you design a study to get at and resolve the underlying issues? (I don’t have an answer for you waiting in the “Answers to Exercises” section; that’s a genuinely open-ended question for you to consider and come to your own conclusions about.) Although the reflexology already represented a nice range of interpretations for you to look at, there are many more examples in the literature. Because the Cyriax method is another modality which is understudied in comparison with Swedish massage, let’s take a quick look at kappa in studies on Cyriax next. (Don’t worry about the statistic called “rho” mentioned in one study; we are not including it in “Just Enough Statistics”.) See if you get the following interpretations out of the text below, and how you do so: • Pellechia finds the Cyriax model highly reliable in evaluating shoulder lesions; • Chesworth finds Cyriax’s “end-feel” technique highly reliable; in a different study, Hayes disagrees, finding it “questionable”. James Cyriax’s approach to diagnosis and treatment of soft tissue disorders is frequently used by orthopaedic and sport physical therapists. The reliability of using Cyriax’s system to determine diagnostic categories, however, has not been established. The purpose of this study was to examine the intertherapist reliability of assessments made using Cyriax’s shoulder evaluation. Twenty-one cases of painful shoulder were evaluated independently by two experienced physical therapists. Therapists used a checklist to indicate their assessment of each case by selecting a specific shoulder lesion or by indicating that the case did not fit the Cyriax model. Cohen’s kappa statistic was used to measure intertherapist agreement. Therapists classified 19 of the 21 cases into the same diagnostic category for a percent agreement of 90.5%. The kappa value was .875, indicating “almost perfect” agreement. Both therapists classified the same four cases of painful shoulder as not fitting the Cyriax model of soft tissue examination. The results of this study show that the Cyriax evaluation can be a highly reliable schema for assessing patients with shoulder pain. (Pellechia 1996) BACKGROUND AND PURPOSE: Findings related to joint function can be recorded with movement diagrams or by characterizing the “end-feel” according to the procedure described by Cyriax. Because both meth- ods are used to classify pain and resistance in relation to joint range of motion (ROM), the purpose of this study was to simultaneously evaluate the reliability of these categorizations in a patient sample. SUBJECTS: Two physical therapists performed 2 assessments of passive lateral rotation of the shoulder in 34 patients. METHODS: Pain and resistance findings were recorded using movement diagrams and end-feel categories. Intraclass correlation coefficients (ICC[2,1]) were used to analyze the ratio (movement diagram) data, and kappa statistics (kappa) were used to analyze the categorical (end-feel) data. RESULTS: Intrarater ICCs varied from .58 to .89. Interrater ICCs for locating maximum pain and resistance in joint ROM varied from .85 to .91. Other interrater ICCs were lower (ICC = .34-.88). Intrarater kappa values for end-feel were moderate (kappa = .48-.59), and interrater kappa values were substantial (kappa = .62-.76). CONCLUSION AND DISCUSSION: Movement diagram measures conceptually related to the end of joint ROM and end-feel were highly reliable. This finding and the fact that additional end-feel categories were introduced in the study may partially explain the end-feel reliability findings. Consideration of their use in future studies may help to determine their clinical utility. (Chesworth 1998) BACKGROUND AND PURPOSE. We explored the construct validity and test-retest reliability of the passive motion component of the Cyriax soft tissue diagnosis system. We compared the hypothesized and actual patterns of restriction, end-feel, and pain/resistance sequence (P/RS) of 79 subjects with osteoarthritis (OA) of the knee and examined associations among these indicators of dysfunction and related constructs of joint motion, pain intensity, and chronicity. SUBJECTS. Subjects had a mean age of 68.5 years (SD = 13.3, range = 28-95), knee stiffness for an average of 83.6 months (SD = 122.4, range = 1-612), knee pain averaging 5.6 cm (SD = 3.1, range = 0-10) on a 10-cm visual analogue scale, and at least a 10-degree limitation in passive range of motion (ROM) of the knee. METHODS. Passive ROM (goniometry, n = 79), end-feel (n = 79), and P/RS during end-feel testing (n = 62) were assessed for extension and flexion on three occasions by one of four experienced physical therapists. Test-retest reliability was estimated for the 2-month period between the last two occasions. RESULTS. Con- sistent with hypotheses based on Cyriax’s assertions about patients with OA, most subjects had capsular end-feels for extension; subjects with tis- sue approximation end-feels for flexion had more flexion ROM than did subjects with capsular end-feels, and the P/RS was significantly correlated with pain intensity (rho = .35, extension; rho = .30, flexion). Contrary to hypotheses based on Cyriax’s assertions, most subjects had noncapsular patterns, tissue approximation end-feels for flexion, and what Cyriax called pain synchronous with resistance for both motions. Pain intensity did not differ depending on end-feel. The P/RS was not correlated with chronic- ity (rho = .03, extension; rho = .01, flexion). Reliability, as analyzed by intraclass correlation coefficients (ICC[3,1]) and Cohen’s kappa coefficients, was acceptable (¿ or = .80) or nearly acceptable for ROM (ICC = .71-.86, extension; ICC = .95-.99, flexion) but not for end-feel (kappa = .17, ex- tension; kappa = .48, flexion) and P/RS (kappa = .36, extension; kappa = .34, flexion). CONCLUSION AND DISCUSSION. The use of a quantitative definition of the capsular pattern, end-feels, and P/RS as indicators of knee OA should be reexamined. The validity of the P/RS as representing chronicity and the reliability of end-feel and the P/RS are questionable. More study of the soft tissue diagnosis system is indicated. (Hayes 1994) κ (kappa) Remember how back in Chapter 2 (check that this is in chapter 2), we talked about how, in order to talk scientifically to each other about bears, we first had to both agree that there really is a real-world referent bear? kappa (the Greek letter κ) is the measurement we use to see how much different observers agree on the “bear”—the subject or entity under study. Think of kappa as a percentage, written in decimal-number form. So the highest (best) value kappa can have is 100% agreement, written as 1.00. Examples of less than perfect (κ less than 100 0.3 (30%) 0.5 (50%) 0.7 (70%) Assigning value judgments (“good”, “moderate”, “poor”) to those kappa values is a matter of interpretation, and different experts differ on what the numbers mean. Let’s look at the next three studies on agreement in reflexology first individually, then as a set, because it opens a bigger question—how do we compare values across studies? The aim of this study was to investigate whether [reflexology] can be used as a valid method of diagnosis...Inter-rater reliability (kappa) scores were very low, providing no evidence of agreement between the examiners. CONCLUSION: Despite certain limitations to the data provided by this study, the results do not suggest that reflexology techniques are a valid method of diagnosis. (White 2000) White is saying that the people who were using reflexology for diagnosis had very low agreement—in other words, their diagnoses were very inconsistent from one person to the next. Presumably, White would agree with criticisms on reflexology theory on the basis that it is internally inconsistent, if it can lead to so much inter-observer variation. In other words, we can’t agree that there really is a bear there. We wanted to test the specific theory behind foot reflexology. Three reflexotherapists examined 76 patients of whom they had no previous knowledge...Interrater agreement, measured by weighted Kappa, ranged from 0.04 to 0.22, and was significantly better than chance (p < 0.05) for six parts of the body. The overall Kappa was 0.11 (95% CI: 0.08-0.14)...The statistical agreement may be better than pure chance, but is too low to be of any clinical significance. (Baerheim 1998) (See? There’s our p and our confidence interval, as well!) Baerheim does not dispute that the agreement among the three reflexotherapists is better than pure chance would lead us to expect, but is not convinced that this finding translates into anything that would be useful in real-world practice (clinical significance). So in other words, something seems to be there, but we can’t agree on whether or not it is a bear and what it means. AIM: The purpose of this study was to test the reliability and validity of the reflexological diagnosis method. METHODS: Eighty patients from various clinics and departments in the Hillel Yaffe Medical Center, Hadera, were examined twice by two different reflexologists. The diagnostics that resulted from these examinations were compared with the conventional medical diagnostics of the same patients. In addition, the level of correlation between the two reflexological examinations was tested. RESULTS: Out of 18 body systems in 6 a statistically significant correlation was found between the conventional medical diagnosis and the two reflexological examinations. In 4 body systems, there was a statistically significant correlation between the conventional medical diagnosis and one out of the two reflexological examinations. The systems in which correlation was found are characterized by having a defined anatomic region. The examination of the significance of the diagnoses regarding the components of the body systems resulted in statistical significance in only 4 out of the 32 components. Between the two reflexological examinations, a statistically significant correlation was found in 14 out of the 18 body systems, and in only 15 out of the 32 system components. CONCLUSION: The reflexology method has the ability to diagnose (reliable and valid) at a systematic level only, and this is applicable only to those body systems that represent organs and regions with an exact anatomic location. (Raz 2003) Raz found more agreement than either Baerheim or White, and drew the comclusion that in certain limited situations (systematic level, and only those systems with exact anatomical locations), reflexology diagnoses are reliably and validly consistent, not only with other reflexology diagnoses, but also with diagnoses from conventional medicine. So can we put all these studies together into a single rigorous and credible meta-analysis? No—they are studying issues which are subtly different, but that difference is enough to make it comparing apples and oranges. What we can do is think about it at the very abstract level as “all of these studies address issues in inter-observer agreement among reflexologists, and they find varying degrees of consistency, from ’very low’ to ’reliable and valid at a systematic level only”’—but we can’t combine them into a meta-analysis on that basis. So what do you think explains the different results? How would you design a study to get at and resolve the underlying issues? (I don’t have an answer for you waiting in the “Answers to Exercises” section; that’s a genuinely open-ended question for you to consider and come to your own conclusions about.) Although the reflexology already represented a nice range of interpretations for you to look at, there are many more examples in the literature. Because the Cyriax method is another modality which is understudied in comparison with Swedish massage, let’s take a quick look at kappa in studies on Cyriax next. (Don’t worry about the statistic called “rho” mentioned in one study; we are not including it in “Just Enough Statistics”.) See if you get the following interpretations out of the text below, and how you do so: • Pellechia finds the Cyriax model highly reliable in evaluating shoulder lesions; • Chesworth finds Cyriax’s “end-feel” technique highly reliable; in a different study, Hayes disagrees, finding it “questionable”. James Cyriax’s approach to diagnosis and treatment of soft tissue disorders is frequently used by orthopaedic and sport physical therapists. The reliability of using Cyriax’s system to determine diagnostic categories, however, has not been established. The purpose of this study was to examine the intertherapist reliability of assessments made using Cyriax’s shoulder evaluation. Twenty-one cases of painful shoulder were evaluated independently by two experienced physical therapists. Therapists used a checklist to indicate their assessment of each case by selecting a specific shoulder lesion or by indicating that the case did not fit the Cyriax model. Cohen’s kappa statistic was used to measure intertherapist agreement. Therapists classified 19 of the 21 cases into the same diagnostic category for a percent agreement of 90.5%. The kappa value was .875, indicating “almost perfect” agreement. Both therapists classified the same four cases of painful shoulder as not fitting the Cyriax model of soft tissue examination. The results of this study show that the Cyriax evaluation can be a highly reliable schema for assessing patients with shoulder pain. (Pellechia 1996) BACKGROUND AND PURPOSE: Findings related to joint function can be recorded with movement diagrams or by characterizing the “end-feel” according to the procedure described by Cyriax. Because both methods are used to classify pain and resistance in relation to joint range of motion (ROM), the purpose of this study was to simultaneously evaluate the reliability of these categorizations in a patient sample. SUBJECTS: Two physical therapists performed 2 assessments of passive lateral rotation of the shoulder in 34 patients. METHODS: Pain and resistance findings were recorded using movement diagrams and end-feel categories. Intraclass correlation coefficients (ICC[2,1]) were used to analyze the ratio (movement diagram) data, and kappa statistics (kappa) were used to analyze the categorical (end-feel) data. RESULTS: Intrarater ICCs varied from .58 to .89. Interrater ICCs for locating maximum pain and resistance in joint ROM varied from .85 to .91. Other interrater ICCs were lower (ICC = .34-.88). Intrarater kappa values for end-feel were moderate (kappa = .48-.59), and interrater kappa values were substantial (kappa = .62-.76). CONCLUSION AND DISCUSSION: Movement diagram measures conceptually related to the end of joint ROM and end-feel were highly reliable. This finding and the fact that additional end-feel categories were introduced in the study may partially explain the end-feel reliability findings. Consideration of their use in future studies may help to determine their clinical utility. (Chesworth 1998) BACKGROUND AND PURPOSE. We explored the construct validity and test-retest reliability of the passive motion component of the Cyriax soft tissue diagnosis system. We compared the hypothesized and actual patterns of restriction, end-feel, and pain/resistance sequence (P/RS) of 79 subjects with osteoarthritis (OA) of the knee and examined associations among these indicators of dysfunction and related constructs of joint motion, pain intensity, and chronicity. SUBJECTS. Subjects had a mean age of 68.5 years (SD = 13.3, range = 28-95), knee stiffness for an average of 83.6 months (SD = 122.4, range = 1-612), knee pain averaging 5.6 cm (SD = 3.1, range = 0-10) on a 10-cm visual analogue scale, and at least a 10-degree limitation in passive range of motion (ROM) of the knee. METHODS. Passive ROM (goniometry, n = 79), end-feel (n = 79), and P/RS during end-feel testing (n = 62) were assessed for extension and flexion on three occasions by one of four experienced physical therapists. Test-retest reliability was estimated for the 2-month period between the last two occasions. RESULTS. Consistent with hypotheses based on Cyriax’s assertions about patients with OA, most subjects had capsular end-feels for extension; subjects with tissue approximation end-feels for flexion had more flexion ROM than did subjects with capsular end-feels, and the P/RS was significantly correlated with pain intensity (rho = .35, extension; rho = .30, flexion). Contrary to hypotheses based on Cyriax’s assertions, most subjects had noncapsular patterns, tissue approximation end-feels for flexion, and what Cyriax called pain synchronous with resistance for both motions. Pain intensity did not differ depending on end-feel. The P/RS was not correlated with chronicity (rho = .03, extension; rho = .01, flexion). Reliability, as analyzed by intraclass correlation coefficients (ICC[3,1]) and Cohen’s kappa coefficients, was acceptable (¿ or = .80) or nearly acceptable for ROM (ICC = .71-.86, extension; ICC = .95-.99, flexion) but not for end-feel (kappa = .17, extension; kappa = .48, flexion) and P/RS (kappa = .36, extension; kappa = .34, flexion). CONCLUSION AND DISCUSSION. The use of a quantitative definition of the capsular pattern, end-feels, and P/RS as indicators of knee OA should be reexamined. The validity of the P/RS as representing chronicity and the reliability of end-feel and the P/RS are questionable. More study of the soft tissue diagnosis system is indicated. (Hayes 1994) 9.10 Break This chapter and the previous one were really dense in terms of the material wecovered. But the hardest part of learning about reading research is now over, and ifyou have stuck with it this far, I promise you that you will find the rest of the book tobe smooth sailing in comparison, building readily on what you have already learned. Since you’ve worked so hard on the methods and statistic parts, here’s another nicebear picture for you to look at, while we take a well-earned break. Figure 9.20: text. 9.11 Exercise 1: 9.12 Exercise 2: 9.13 Next steps Now that we know what the methods and the most important (for our purposes) statistics are, let’s move on to look at how study data is reported in the “Results” section. ===================================================================================================== Trish Greenhalgh, in her excellent book How to Read a Paper (2001), puts her own twist on the meaning of evidence-based medicine, defining it as: “…the enhancement of a clinician’s traditional skills in diagnosis, treatment, prevention, and related areas through the systematic framing of relevant and answerable questions and the use of mathematical estimates of probability and risk” (p.1). This adds two new concepts to our definition of evidence-based practice. First, practitioners need to learn how to ask clinical questions that are answerable, something that is not as straightforward as it initially sounds. Secondly, we need to learn to understand the components of research that scare many of us the most – the statistics. Statistics are part of the researcher’s effort to demonstrate the extent to which the data being presented is valid. There are many simple ways of evaluating statistics without becoming a statistician. Dryden & Achilles • Statistics: mean, median, mode, standard deviation, power • Average • Mean • Median • Mode • Percentile • Standard deviation • False positive error (type I error) • False negative error (type II error) • (alpha) • (beta) • p • Confidence interval • Sampling • Power and sample size • (kappa) Objectives for this chapter: Do • name and explain the most common and most important statistical mea- sures used in articles on massage research Know • Statistics: mean, median, mode, standard deviation, power • Average • Mean • Median Mode • Percentile • Standard deviation • False positive error (type I error) • False negative error (type II error) • α (alpha) • β (beta) •p • Confidence interval • Sampling • Power and sample size • κ (kappa) Appreciate Descriptive statistics ##### Challenge Our data in studies typically comes from measurements on individuals. How can we use this individual data to make meaningful statements about populations, so that we can generalize the knowledge we gain from research studies? As discussed in Chapter 2, "normal" is a word that has a specific, neutral meaning in science, yet can have strong connotations in everyday language. In scientific use, "normal" means "typical, usual, or according to the rule or standard". Generally, very few people in the total population are extreme in some physical measurement; most are pretty close to a typical value in respect to most measurable physical qualities. For example, consider the birth weight of all babies born in developed countries. In this group, there will be a few big babies, weighing 8½ to 9 pounds or more. There will also be a few small babies, who weigh 6 to 6½ pounds or less. Unless some sort of problem occurs, such as gestational diabetes or premature birth, most babies weigh about 6½ to 8 pounds at birth. This weight is called a "normal" birth weight and takes its name from where it is found on the graph of a normal distribution in a defined population or group. In this example, the population being considered is girl babies born in Europe. Source: http://basicmathsuccess.files.wordpress.com/2012/02/birth-weights-bell-curve-1.jpg?w=640&h=474 accessed 1 May 2012 This image shows a graph representing birth weight as a normal distribution, also called a "bell curve". In this graph, the vertical axis describes the number of babies, and the horizontal axis describes the birth weight. The relatively few very small and very large babies are the small quantities shown at the extreme left and right sides of the graph (forming the small “tail” at either end). The higher number of 6½ to 8 pound babies make up the big “bump” or curve at the center of the graph. The group making up the largest part of the distribution represents the normal values. In this sense, normal means “most commonly found.” Since data values for many natural phenomena tend to form this normal distribution—with most of the numbers in the middle and a few extreme values at either end—when not subjected to some purposeful manipulation (such as a massage treatment), this effect can be used as a baseline for measuring the distribution of data after such a treatment to see whether it differs significantly from the way the data was distributed before the treatment. Recognition of this possibility lies at the heart of some of the most useful and powerful concepts in mathematics and science. While all bell curves have the same basic features, there can be some important differences in the details of how the data the normal distribution represents is arranged. Figure 4-4 shows two bell curves that illustrate different normal distributions, in which the data is spread out (dispersed) in different ways. In a steep curve, the data is clustered closely together; in a gently sloped curve, the data is spread more widely. #### Median Since extreme values can cause the mean to give a misleading picture of the data it is intended to describe, it is often useful to apply another approach to the concept of averaging. The median can provide insight into the distribution of the data that the mean often cannot. Median literally means “in the middle.” The strip that runs directly down the middle of a highway, like in this image from Dublin, Ireland, is called a median. This image from Gray's Anatomy shows the median antibrachial cutaneous nerve running down the middle of the upper arm in cutaway. In statistics, the median represents an average of the data that has been calculated in a different way from the mean. Imagine first sorting data values into a list that runs from high to low. Then imagine drawing a line at the exact midpoint, representing the median – the value above which half the data falls and below which the other half of the data falls. The median is representative of the data in a different way from the mean—it is the value above which half the data falls, and below which the other half of the data falls. The median for a set of test scores is shown in Table 4-1. In this example, the median score is 65 (Susan’s score). Four students have scores higher than 65 and four others have scores lower than 65, so Susan’s score represents the median In Figure 4-3,, the median is shown by the line that cuts the bell curve exactly in half, showing that half the babies had birth weights to the left of the center line (lower than the median) and half to the right (higher than the median). Let’s find the median score for the before massage group in Barlow’s sit-and-reach study. First, we’ll rearrange our table in descending order, so that the sit-and-reach scores go from highest to lowest. Note that sometimes in the real world things are not cut-and dried—with this data there are 4 scores above the medium score, and 5 scores above the Medium score. The scores for subject 10 and for subject 6 were as close to a middle score as we could get. Therefore the median value is roughly in the middle. Figure 9.2: Sit-and-reach scores in centimeters (Barlow 2004). Note that here there are exactly 4 scores above and 4 scores below the score of 17, which is the score for subjects 7,6 2. So in this case the median value is exactly in the middle. Here is another example of a researcher reporting median values. Note: just focus on the black text; we’ll get to percentiles in a little bit. * Data are reported as median (25th percentile, 75th percentile). Fifty-three patients (30 controls and 23 minute spheres) completed the study. ** Morphine requirements (47 mg [27, 58] vs ***41 mg [25, 69]) and pain scores (29.5 mm [16, 59] vs 40 mm [22, 58]) were similar in the control and acupressure groups. (Sakurai 2003) Let's find the median score for the before massage group in Barlow's sit-and-reach study. First, we'll rearrange our table in descending order, so that the sit-and-reach scores go from highest to lowest. Note that sometimes in the real world things are not cut-and dried---with this data there are 4 scores above the medium score, and 5 scores above the Medium score. The scores for subject 10 and for subject 6 were as close to a middle score as we could get. Therefore the median value is roughly in the middle. Note that here there are exactly 4 scores above and 4 scores below the score of 17, which is the score for subjects 7,6 & 2. So in this case the median value is exactly in the middle. The median is representative of the data in a different way from the mean—it is the value above which half the data falls, and below which the other half of the data falls. Let’s find the median score for the before massage group in Barlow’s sit-and- reach study. First, we’ll rearrange our table in descending order, so that the sit-and-reach scores go from highest to lowest. Note that sometimes in the real world things are not cut-and dried—with this data there are 4 scores above the medium score, and 5 scores above the Medium score. The scores for subject 10 and for subject 6 were as close to a middle score as we could get. Therefore the median value is roughly in the middle. Note that here there are exactly 4 scores above and 4 scores below the score of 17, which is the score for subjects 7,6 2. So in this case the median value is exactly in the middle. Here is another example of a researcher reporting median values. Note: just focus on the black text; we’ll get to percentiles in a little bit. * Data are reported as median (25th percentile, 75th percentile). Fifty- three patients (30 controls and 23 minute spheres) completed the study. ** Morphine requirements (47 mg [27, 58] vs ***41 mg [25, 69]) and pain scores (29.5 mm [16, 59] vs 40 mm [22, 58]) were similar in the control and acupres- sure groups. (Sakurai 2003) ##### Mode In a set of data, the mode is the most frequently occurring value. This is a less commonly used way of representing the average of a group of data values. For example, if a group of scores representing reduction in pain levels among eight people who received massage are 1, 2, 4, 5, 5, 5, 6, and 7, the mode is 5, because it occurs more frequently than any of the other scores. Not all sets of data have a mode; if the pain scores in that population were 1, 2, 3, 4, 5, 6, 7, and 8, all the values occur exactly once. Since no value occurs more often than the other values, there is no mode in that data set. A data set can also have more than one mode. If the scores had been 1, 2, 2, 2, 4, 7, 7, and 7, the data would have had two modes, 2 and 7, because they occurred with equal frequency in that group. Mode The mode represents the average of data in a different way than either the mean or the median does. The mode is the value which occurs most frequently in the data. In other words, it is the value “most typical” of the population. A data set can have no mode (if all the values are unique—although some authors consider this to mean that every value is a mode), one mode (if one particular value is the most frequently-occurring value), or more than one mode (if multiple values are tied for the most-frequently occurring value). Figure 9.7: text. This data set has two modes, 14 and 13. This data set has one mode, 17. You won’t come across mode too often as its meaningfulness in research applies more narrowly than mean and percentile. Perhaps give an example of a fictitious research experiment where MODE would be meaningful. The statistics we have covered up until now are useful, but in order to get a clearer picture of what all the data looks like, there are more refined tools we can use to understand the relationships among the values. Standard deviation (SD) is one of those tools In a set of data, the mode is the most frequently occurring value. This is a less commonly used way of representing the average of a group of data values. For example, if a group of scores representing reduction in pain levels among eight people who received massage are 1, 2, 4, 5, 5, 5, 6, and 7, the mode is 5, because it occurs more frequently than any of the other scores. Not all sets of data have a mode; if the pain scores in that population were 1, 2, 3, 4, 5, 6, 7, and 8, all the values occur exactly once. Since no value occurs more often than the other values, there is no mode in that data set. A data set can also have more than one mode. If the scores had been 1, 2, 2, 2, 4, 7, 7, and 7, the data would have had two modes, 2 and 7, because they occurred with equal frequency in that group. ##### Treatment Approach The mode responses for the best approach to treatment for all conditions were 3 and 4, indicating an “Equal Mix” or “Mostly CAM” (Figure 4). #### Percentiles A researcher may report data by citing the percentile of a particular value. The percentile represents a value on a scale of 100 that indicates the percent of a distribution that is equal to or below that value. For example, if the birth weight of a baby is reported to be in the 99th percentile, it means that 99% of the other birth weight values are lower than that value. An example of how median and percentiles of data are reported is shown in Table 4-2. This table summarizes the results of a study (Sakurai et al.) in which researchers were looking for an alternative to morphine for postoperative pain relief in patients who had abdominal surgery. They were evaluating whether the use of minute sphere acupressure had any effect on postoperative pain levels and morphine requirements. As shown in this table, 53 patients completed the study (30 in the control group and 23 in the treatment group). Data shown in columns for both groups reflect the median, 25th percentile, and 75th percentile values. (Remember that the median represents the value for which half the values are higher and half are lower, so another way of referring to the median is as the 50th percentile.) As you can see, the postoperative morphine requirements and pain scores were similar in both the control and treatment groups, so the study concluded that the acupressure treatment had no significant effect on either measurement. These findings were stated in the research article as: Morphine requirements (47 mg [27, 58] vs 41 mg [25, 69]) and pain scores (29.5 mm [16, 59] vs 40 mm [22, 58]) were similar in the control and acupressure groups.2 As shown in this excerpt, it is common for researchers to report key measurements in the shorthand format of listing the median value, followed by the 25th percentile and 75th percentile values in brackets, for the control group and any treatment groups. If a value is in the 99th percentile, that means that 99% of the values are lower. For a value in the 60th percentile, 60% of the values are lower; for a value in the 30th percentile, 30% of the values are lower, and so forth. Now let’s revisit the excerpt from Sakurai 2003 on the previous page . Note that Figure 9.3: text. Figure 9.4: text. the previously white print is now black bolded because we are going to discuss that now. *Data are reported as median (25th percentile, 75th percentile). Fifty-three patients (30 controls and 23 minute spheres) completed the study. **Morphine requirements (47 mg [27, 58] vs ***41 mg [25, 69]) and pain scores (29.5 mm [16, 59] vs 40 mm [22, 58]) were similar in the control and acupressure groups. (Sakurai 2003) Interpreting what is this says is not as difficult as it first looks. Below illustrates how this shorthand gets translated. As you can see, it not really that difficult. * “Data are reported as median (25th percentile, 75th percentile).” Figure 9.5: text. Figure 9.6: text. **“Morphine requirements (47 mg [27, 58] vs *** 41 mg [25, 69])” If a value is in the 99th percentile, that means that 99\% of the values are lower. For a value in the 60th percentile, 60\% of the values are lower; for a value in the 30th percentile, 30\% of the values are lower, and so forth. Now let's revisit the excerpt from Sakurai 2003 on the previous page . Note that the previously white print is now black bolded because we are going to discuss that now. *Data are reported as median (25th percentile, 75th percentile). Fifty-three patients (30 controls and 23 minute spheres) completed the study. **Morphine requirements (47 mg [27, 58] vs ***41 mg [25, 69]) and pain scores (29.5 mm [16, 59] vs 40 mm [22, 58]) were similar in the control and acupressure groups. (Sakurai 2003) Interpreting what is this says is not as difficult as it first looks. Below illustrates how this shorthand gets translated. As you can see, it not really that difficult. * Data are reported as median (25th percentile, 75th percentile).'' **Morphine requirements (47 mg [27, 58] vs *** 41 mg [25, 69])'' % Treatment Group Control Group % % And here all of the information can be put into a chart to make the results clearer. % % Morphine Requirement Pain Score % % No. of Patients 25% falls below MEDIAN % %Approx.50% % %falls below 75% % %falls below 25% falls below MEDIAN % %Approx. % %50% % %falls % %below 75% % %falls below % %CONTROL % %GROUP 30 27 mg 47 mg 58 mg 16 mm 29.5 mm 59 mm % %TREATMENT % %GROUP % %receiving % %MINUTE % %SPHERE % %ACUPRESSURE 23 25 mg 41 mg 69 mg 22 mm 40 mm 58 mm % %TOTAL 53 % %Table 9-7 If a value is in the 99th percentile, that means that 99% of the values are lower. For a value in the 60th percentile, 60% of the values are lower; for a value in the 30th percentile, 30% of the values are lower, and so forth. Now let’s revisit the excerpt from Sakurai 2003 on the previous page . Note that Figure 9.3: text. Figure 9.4: text. the previously white print is now black bolded because we are going to discuss that now. *Data are reported as median (25th percentile, 75th percentile). Fifty-three patients (30 controls and 23 minute spheres) completed the study. **Morphine requirements (47 mg [27, 58] vs ***41 mg [25, 69]) and pain scores (29.5 mm [16, 59] vs 40 mm [22, 58]) were similar in the control and acupressure groups. (Sakurai 2003) Interpreting what is this says is not as difficult as it first looks. Below illustrates how this shorthand gets translated. As you can see, it not really that difficult. * “Data are reported as median (25th percentile, 75th percentile).” Figure 9.5: text. Figure 9.6: text. **“Morphine requirements (47 mg [27, 58] vs *** 41 mg [25, 69])” If a value is in the 99th percentile, that means that 99\% of the values are lower. For a value in the 60th percentile, 60\% of the values are lower; for a value in the 30th percentile, 30\% of the values are lower, and so forth. Now let's revisit the excerpt from Sakurai 2003 on the previous page . Note that the previously white print is now black bolded because we are going to discuss that now. If a value is in the 99th percentile, that means that 99\% of the values are lower. For a value in the 60th percentile, 60\% of the values are lower; for a value in the 30th percentile, 30\% of the values are lower, and so forth. Now let's revisit the excerpt from Sakurai 2003 on the previous page . Note that the previously white print is now black bolded because we are going to discuss that now. *Data are reported as median (25th percentile, 75th percentile). Fifty-three patients (30 controls and 23 minute spheres) completed the study. **Morphine requirements (47 mg [27, 58] vs ***41 mg [25, 69]) and pain scores (29.5 mm [16, 59] vs 40 mm [22, 58]) were similar in the control and acupressure groups. (Sakurai 2003) Interpreting what is this says is not as difficult as it first looks. Below illustrates how this shorthand gets translated. As you can see, it not really that difficult. #### Boxplots A boxplot, (also known as a box-and-whiskers diagram) is a method of graphically summarizing groups of numerical data. The five statistical measures it depicts are: the median, the upper and lower quadrilles, and the minimum and maximum data values. The boxplot can also show any outliers (values that are significantly outside of the main data grouping). Figure 4-6 is a boxplot that represents the results of Sakurai’s study in regard to patients’ morphine consumption. Each box represents the 25th percentile through the 75th percentile values, and the “whiskers,” or extended lines on either end of the box, represent the highest and lowest observed values. The line through the middle of the box represents the median (50th percentile). Note that the overall positioning of the treatment group box on the graph is similar to that of the control group box; the treatment group median value is lower, but not low enough for the difference in morphine consumption between the groups to have been statistically significant. Figure 4-7 shows how a boxplot, if tipped onto its side, represents much the same distribution of data as that shown in a bell curve. The values inside the box are roughly equivalent to those contained in the “normal” section of a bell curve. The values represented by the “whiskers” extending from the end of each box (the highest and lowest observed values of the total data range) are roughly equivalent to the tails shown in the bell curve. #### Data Range The range of data about a population includes all the values, from the lowest to the highest. The range of normal values refers to the values from lowest to highest that are considered to be normal. In the following example, researchers used a range of normal values for heart rate and systolic blood pressure to evaluate complementary pain therapies for safety in a population of heart surgery patients (Kshettry et al.). Complementary therapies (touch, music) are used as successful adjuncts in treatment of pain in chronic conditions. Little is known about their effectiveness in care of heart surgery patients. Our objective is to evaluate feasibility, safety, and impact of a complementary alternative medical therapies package for heart surgery patients…. Decreases in heart rate and systolic blood pressure in the complementary therapies group were judged within the range of normal values…. Complementary medical therapy was not associated with safety concerns and appeared to reduce pain and tension during early recovery from open heart surgery.3 While massage is known to lower heart rate and blood pressure in promoting relaxation, the question Kshettry’s team was interested in was whether such reductions were safe for this group of patients or whether it would cause those values to drop too low. Even though the therapies studied caused the heart rate and the systolic blood pressure to decrease, those measures never fell below the lowest normal value, that is, they remained within a normal range. For that reason, the team concluded that there were no safety concerns regarding the use of complementary therapies among this population of heart surgery patients. #### Range of means Notice that two concepts learned separately (mean and range) can be combined to form a more complex measure, the range of means. For any one set of data values, there is only one arithmetic mean. When a range of means is shown, there must be multiple sets of values corresponding to multiple aspects being tracked or measured. In the research literature, the upper and lower bounds of the range are often included in parentheses next to the mean, as shown in the following example: Facial Grading Scale change scores showed that experimental group (27.5 (20-43.77)) improved significantly more than the control group (16.5 (12.2-24.7)).4 In this study Manikandan examined the effects of a treatment called facial neuromuscular re-education, comparing it to conventional therapeutic measures to determine whether one or the other was more effective in treating Bell’s palsy, a type of paralysis that affects one side of the face, involving the facial nerve on that side. In this case, effectiveness referred to improvements in facial symmetry by relieving the paralysis on one side of the face. The research team used Facial Grading Scores to measure patients’ improvement and found that the experimental group experienced a mean improvement on all items on the scale of 27.5, with that mean representing a data set whose values range from 20 at the lowest to 43.77 at the highest. The control group experienced a mean improvement of 16.5, with that mean representing a data set whose values range from 12.2 at the lowest to 24.7 at the highest. Therefore, the study concluded that individualized facial neuromuscular re-education is more effective in improving facial symmetry in patients with Bell’s palsy than conventional therapeutic measures. #### Variance and Standard Deviation Variance is a statistical measure that describes how spread out , or dispersed, the data values are from the mean. Since the mean is a type of average of all the data as a whole, the more data that is like the mean, the more representative the mean is of the entire data set. Example 1 – The mean of the values 2, 3, 3, 3, and 4 is 3 (the total of all the numbers is 15, and 15 divided by 5 = 3). Example 2 – The mean of the values 1, 2, 3, 4, and 5 is 3 (the total of all the numbers is 15, and 15 divided by 5 = 3). The mean in Example 1 (3) is more representative of the overall data than is the mean in Example 2 (also 3). This is because the values in the second data set are farther from the mean than are those in the first set, which has no ones or fives. The variance measure is used less frequently than the standard deviation (SD), which is represented by the mathematical symbol σ (pronounced SIG-ma). The SD is calculated using a fairly complex formula, and as a reader of research literature, you don’t need to know how to do the calculation. It is important to understand that the principle behind the SD is exactly the same as variance – how far away from the mean the actual data for a population is distributed. The smaller the standard deviation, the closer many data points are to the mean. In this case, the data would be described as having a minimal amount of dispersion. Therefore, a larger SD means that more data points are farther away from the mean – more widely dispersed – and because of those extreme values, the mean is not as representative of that set of data. The SD can be “plus or minus,” meaning that the data can be dispersed on either side of the mean, higher or lower. In a population with a normal distribution (bell curve), the data is dispersed around the mean, as shown in Figure 4-8. In a normal distribution, 68% of the values for that population are within 1 positive SD or 1 negative SD of the mean for that population. Similarly, 95% of the population is within 2 SD in either direction of the mean, and 99.7% is within 3 SD in either direction of the mean. Data is reported in the format of mean ± (plus or minus)SD. For example, a measurement shown in inches as 4 ± 1.5 indicates a mean of 4 inches, with a standard deviation of 1.5 inches. How the mean and the SD are used in the reporting of research results can be seen in the following example from a project that studied the effects of coconut and mineral oil in infant massage (Sankaranarayanan).5 The infants massaged with coconut oil showed an average weight gain in grams at 31 days of 2396.77 ± 208.94. This tells the reader: The average weight of the infants in that group is 2396.77 grams. Of those infants’ weights, 68% fall in a range from 2187.83 grams (2396.77 grams – 208.94 grams, or 1 SD less than the mean) to 2605.71 grams (2396.77 grams + 208.94 grams, or 1 SD more than the mean). Power and sample size One limitation often found in massage research methods relates to study size—you’ll find statements in the literature like, “Most studies contain methodological limitations including … few subjects …”,1 or “These conclusions are limited by the small sample size of the included [research studies].”2 Clearly, when it comes to results, something methodologically important is going on with small studies. Additionally, you may have heard people say a massage research study needs about 35 or 40 people, more or less, to have a large enough sample size—what’s up with that? What’s so special about that number? Like the indicator of statistical significance p discussed in the last issue, the power of a test is a probability. In this case, it is the probability that the test will not make a Type II error (false negative) by missing a treatment effect that is really there. When p < 0.05, for example, it represents a less than 5% chance, or 1 time out of every 20 that you rerun the study, that you would make a Type I error (false positive), or think that you were observing a real effect, when it was really due to chance. While there is no universal measure of power, you’ll often see 0.80 as a target that researchers aim for—it means that they expect that 80% of the time, or 4 times out of 5, if there is a treatment effect in the study, they will detect it. (Remember, for both p and power, when it is represented as a decimal number, multiply that number by 100 to get the percentage it represents.) The risks of false negative and false positive errors can never be totally eliminated, but judicious use of statistical significance and of power allow both of those risks to be managed, resulting in a certain degree of confidence in the validity of the study results. The ideas of statistical power, sample size, and the null hypothesis are tightly linked to each other, and to considerations presented in the Methods section. For reasons we’ll get deeper into in a later discussion, researchers look at the evidence to see whether it calls for rejecting the null hypothesis and supporting their own hypothesis. For example, if a researcher hypothesizes, like Jönhagen’s (“Sports Massage After Eccentric Exercise”) team did, that “Sports massage can improve the recovery after eccentric exercise,”3 then the null hypothesis would be something like “Sports massage has no effect on recovery after eccentric exercise.” All of these concepts come back, ultimately, to whether to accept or reject the null hypothesis. As it happens, Jönhagen’s team did end up accepting the null hypothesis and rejecting his research hypothesis, because they found that the massage had no effect on their measurements of quadriceps pain, strength, or function after the exercise. We’ll get back to the larger implications of those findings toward the end of this chapter, but here, we’ll just talk about the null hypothesis. A goal of a research study is to try to correctly determine whether or not to accept or reject the null hypothesis—neither to accept it mistakenly (false negative) nor to reject it mistakenly (false positive). To see how that works in practice, we’ll switch from sports massage to cardiac surgery for a moment, since a particular research article demonstrates clearly how the researchers calculated a power analysis for their study. Hattan’s (“The Impact of Foot Massage and Guided Relaxation Following Cardiac Surgery: A Randomized Controlled Trial”) research team investigated whether foot massage and guided relaxation promoted calmness (among other measures) in cardiac surgery patients. Their description of how they determined the ideal sample size for their study points at the multiple factors involved: “A post hoc [carried out after the study] power analysis test suggested that a sample size of 45 would be required to detect a difference of the size observed with an acceptable level of Type II error [false negative] (power = 0.8).”4 From this statement, we can see that statistical power has to do with detecting an effect, with the size of a sample, and with how much risk of error we’re willing to tolerate. In the literature, you’ll often see it written in a much shorter way, but Hattan’s description shows details of what is involved in a power analysis—sample size, effect size, and acceptable tolerance of error. One way to think of it is, how large a study population do you need to make sure you see an effect that is there—that you don’t make a false negative error by missing something? If it’s a large effect, you probably don’t need as many people to see it as you do if it’s a small effect—in other words, if it’s something that could be easily missed, you improve your chances of seeing it by looking for it in more people. But if it’s a major effect, it will probably show up more dramatically, and you can see it in fewer people. For that reason, increasing sample size is a very common way of increasing the power of a test. So where did that often-mentioned number 35–40 for massage studies referred to earlier come from? It’s an estimate that probably came out of one particular study as having sufficient power in that context, and was then accidentally generalized into a more universal number that is sometimes quoted as applying to many massage research studies. But since a sufficiently large sample size depends on the size of the effect being looked for, and how much risk of false negative error the researchers are willing to accept, it really depends on the question being researched. When researchers design a study, they put a lot of time and effort into the question of how many participants to include, and they consult statisticians to determine that number, because they know that funding agencies and peer review will (or, at least, they should) examine it carefully to determine whether they’ve gotten it right for their purposes. There’s no “one size fits all” number that massage research studies should have to ensure sufficient power. Instead of trying to come up with such a number for all studies, a better strategy is to follow the researcher’s logic, as explained in the article, for why that particular number was right—ensured sufficient power—for that study on its own terms. If the researchers’ explanation of how the sample size was chosen makes sense, it’s probably worth trusting for purposes of evaluating that article. If it doesn’t make sense, or if it is not explained at all, it may indicate a problem for interpreting the study’s results. The statistics we have covered up until now are useful, but in order to get a clearer picture of what all the data looks like, there are more refined tools we can use to understand the relationships among the values. Standard deviation (SD) is one of those tools ##### Average The average is an attempt to describe qualities of a group by combining qualities of individual members of the group., The mean, median, and mode describe different ways of averaging, which tell something about the distribution of those individual qualities or values. Mean Although you may not have heard it referred to by that name, you’re already familiar with the concept of mean: it is the kind of average commonly seen in school grading. To get the mean, you add all the results together, and then divide by number of results. Example from the literature: Barlow 2004 investigated whether a single massage would alter the flexibility of the hamstring in physically-active young men, as measured by the value on the sit-and-reach test He included his data in Table 1, so we can calculate the mean of all the sit and reach scores for the subjects (1) before and (2) after the massage by adding all the values in the appropriate column, and then dividing by 11 (the number of subjects in the study): The disadvantage of the mean is that it can’t tell you about extreme values in the data, or how any individual compares to the group, except in the most crudely approximate way. In order to examine this limitation further, let’s set up our own table including the mean score (shown in the callouts in table 1 on the previous page). In the last column, observe the difference in score for each subject from the mean score. Mode Figure 9.7: text. The statistics we have covered up until now are useful, but in order to get a clearer picture of what all the data looks like, there are more refined tools we can use to understand the relationships among the values. Standard deviation (SD) is one of those tools. Preparing to discuss standard deviation This (standard deviation) is probably the hardest concept we are going to cover. But it is worth it, because of the value of the concept and its applicability to so many different situations. So let’s break this up into small pieces to tackle one piece at a time, and see how we can use it, not only in reading massage research, but in many other situations as well. I remember when a sad event in my childhood brought home to me the concept of a population, although I certainly didn’t think about it that way at the time. Figure 9.8: text. When I was in fifth grade, a child at my school died. Although I didn’t know the child personally, I was sad to hear the news, as was everyone else there. Then I started putting it together with what had happened the year before, when another child had died. I figured out that there must be some kind of rule that every year one child dies at our school, and that next year it could be me. That particular thought was scary enough to keep me up awake for a couple of nights. Although I was kind of on the right track in certain ways, there were some flaws in my analysis; however, as I was 10 years old at the time, I think I can be forgiven for a certain lack of mathematical rigor. The observation that there was a pattern—the death of one child per year—was a reasonable observation for that very short time span, although if I had been paying attention longer, it is possible that there would have been many other years where no child at that school died. But from that observation of a pattern, I went a little too far in imaging a “rule” that one child died every year—it would be better to think of it as a description of what did happen, rather than as a prescription for what must happen. If you think of it in that way, you can see one function that statistics serves—descriptive statistics summarizes the data about a population or a study, and describes in what way they are similar (central tendency) or different (variability). It takes a very diverse group, and tries to convey concisely and efficiently to the audience what the important measures of that group are. The statistical measures we have gone over up until now—mean, median, mode, and percentile—are descriptive statistics. Figure 9.9: text. 1I was, and still am, quite happy to have been proved wrong on that prediction. Inferential statistics takes things a step farther—it lets us use reasoning to infer, or make predictions, about the group, based on what we already know. It’s what I was dimly sensing when I realized that another could die at my school the next year1, and so came up with my “rule”. The statistics we are going to talk about now are inferential statistics, and understanding the concepts of normal distribution, standard deviation, types of error, sample size and power, and inter-observer agreement will make a great deal—even most—of the massage research literature accessible to you. Finally, one more thing about my example, and then we’ll let it go—remember in Chapter 3 when we talked about how science is about what’s common to everyone, while spirituality can be about what is unique and special? I’ve gotten the sense from some of my students, and have felt it myself, that there is something vaguely disturbing about talking about such sad events as a child’s death in terms of a population event, and I suspect that some of the aversion I’ve heard people express to science has something to do with the sense that science somehow sucks out what is special about being human. I would respectfully suggest that the two are not mutually exclusive—it is possible to operate in the two different modes at different times, as appropriate, and in that way to get the best of both—the rigor AND the compassion, as we talked about earlier. AND THAT .... 9.6 Standard deviation Standard deviation has a lot in common with the averages we discussed earlier, and we will talk about how we can use it as a kind of descriptive statistic. To understand standard deviation, however, we first have to all be on the same page about what normal distribution means, so we’re going to talk about that first, and then come back to standard deviation. 9.6.1 Normal distribution We talked earlier in Chapter 2 about how “normal” is one of those words that has a specific, neutral meaning in science, yet has very strong connotations in everyday language. It’s unfortunate that this word is so heavily loaded, as it is one of the most useful and powerful statistical concepts there is, and serves as a gateway to the world of inferential statistics. The word has been used as a weapon to enforce social and medical agendas—after I have taught a session on massage research and fibromyalgia, I’ve had people some up to me afterwards and tell me how painful it is to be told they are not “normal”, where “normal” is a prescriptive word for how they should be. Let’s be very clear that this is not how we’re using the word. Our specific statistical use of the word is defined below. First of all, think about a situation you’ve been in with a lot of other people—a lot of the time, a few people are extreme in some value one way or the other, but most people are pretty close to average. We’ve all been born, so let’s consider the weight at birth in all healthy babies born in the US as our example situation. A few very big babies: 8 (1/2) to 9 pounds, A few very small babies: 6 to 6 (1/2) pounds Most babies somewhere around 7 or 8 pounds or so, more or less: called normal birthweight because it forms a normal distribution. This is what that normal distribution looks like. The curved line is called a bell curve—a pretty descriptive name, because it is indeed shaped like a bell [bell character]. Figure 9.10: text. Figure 9-2: Bell curve showing normal distribution of birthweights While all bell curves have the same basic features of a small “tail” at either end (representing a few extreme values) and a large “bump” in the middle (representing a lot of typical values), there can still be some dramatic differences in how the data the bell curve represents is arranged. The following are both bell curves, but look how different they are from each other: Figure 9-3: Two different bell curves Figure 9.11: text. The graph on the left is tall and narrow and drops off sharply. The graph on the right is shorter and drops off much more gently. These differences are useful, because they tell something about the data being studied—namely, about how different the extreme values are from the more typical values for that population. The standard deviation, which is coming up, will explore that distinction in more detail. So now that we are familiar with normal distributions and bell curves, let’s return to standard deviation, and see how that helps us with reading the massage research literature. Back to standard deviation We discussed earlier that the mean can sometimes be a useful way to summarize and describe the data. But the mean can be so different from that data that it does not give an accurate description of that data because the data under study is extremely high or extremely low. To put it another way, according to the “Bill Gates Net Worth” web page2 just now, at this moment, Bill Gates’ net worth is$27,600,000,000 (give or take). So, if I told you that on average, Bill Gatesand I each have a net worth of $13,800,000,000—did you really learn anything relevant and useful about me3? Or did you just get a graphic demonstration of how badly the mean fails when it has to deal with extreme values? Clearly, we need a better tool for describing populations that—like our big, small, and average-sized babies—exhibit a great deal of variation, and the standard deviation (SD) is one of those tools we can use. We won’t bother with the mathematics behind the SD here, because for our purposes, I just want you to be able to recognize it when you come across it in the literature, and to understand what it means. 2 Yes, there really are some people with that much spare time on their hands. You can find it at: http://bgnw.marcus5.net/bgnw.html if you like. 3 If only! :) Sometimes you’ll see the SD called the mean of the mean [ref]—that refers to the way it is computed mathematically, and also to the way it describes data more accurately than just the mean alone does. Assuming a normal distribution of data (our bell curve), the standard deviation describes where in the bell curve the data lies. And so the normal distribution and standard deviation can deal with extreme data as well as more representative data. Further, a large standard deviation can indicate to the reader that there is something wrong with the data, or with the model, or with both. First of all, let’s put some further meaning on our bell curve. Below we have a bell curve where the different sections are shaded. Figure 9.12: text. Figure 9-4: A bell curve with standard deviations • The solid gray area, referred to as 1 standard deviation from the mean, represents the largest number of data values. Values that fall in this area of the graph are considered the most normal. We expect 68% of our values to lie somewhere within this range. • The striped area, referred to as 2 standard deviations from the mean, represents a larger number of the values. We expect 98% of our values to lie within this range (notice that to get from one striped range to the other, we have to go through the gray ranges, so we include that previous 68% in our estimate of 98%). • The data in the black area, referred to as 3 standard deviations from the mean, represents a small percentage of the values. We expect about 99.7% of our values to lie within this range (notice that to get from one black range to the other, we have to go through the striped ranges and the gray ranges, so we include that 98% in our estimate of about 99.7%). So now you can begin to see how this addresses the problem with the mean and the extreme values that we’ve encountered—if we know the mean, and we know how far away (how spread-out) from the mean a particular value of data is, then we have a much more powerful tool for accurately and clearly representing the data than the mean alone is able to provide4. This is useful because it tells us how “spread-out” the population is. The larger the SD, the more chance you should be somewhat skeptical of the study. Remember our previous two bell curves? Figure 9-5: Two different bell curves with varying standard deviations A false positive error (also called a type I error), for our purposes, exists when it looks like the treatment, such as massage, caused an effect when it really didn’t. In other word, its positive result was false. Here is a hypothetical research experiment to see the effect of massage on blood pressure to illustrate how this 4So you get a much more accurate description of where I really am if I tell you that on average, Bill Gates and I each have a net worth of$13,800,000,000.00, and that I am more than 3 SD away from that mean. Let’s just leave it at that for now. :)

Figure 9.13: text.

can happen.

Figure 9.14: text.

Figure 9-6

In this experiment, the researchers concluded that massage does indeed lower blood pressure. But suppose the researchers made a change in the experimental design and instead of having the control subjects sit in a chair for one hour, they lay down on the massage table for an hour which caused a different result for the control group.

Figure 9.15: text.

Figure 9-7

With this experimental design lying down on the table for one hour, without being massaged, also lowered blood pressure. The conclusion from this experimental design would be that lying on the table, and not the massage itself, lowers blood pressure.

Note that this was not a real experiment and that this may or may not be true. Also note that this is an example of an experiment that any massage therapist can carry out.

A false negative error (also called a type II error) exists when it looks like the treatment, such as massage, had no effect, but it really did.

Here is another hypothetical experiment that demonstrates a false negative error. Suppose the 1 hour massage and also just lying on the table for an hour both resulted in no change in blood pressure.

Figure 9.16: text.

Figure 9-8

However, in this hypothetical experiment, the researcher hypothetically did not pay attention to a couple of important factors. One was that the massage therapist being used to perform the massages was only available on Monday. And on Monday, there were workmen using jackhammers just outside the window. Also, suppose it is summer and the windows are open. However, when the subjects in the control group came participate in the study by just lying on the table without getting a massage, it was later in the week and the workmen were gone.

Figure 9-9

If the researcher is unaware of the jackhammer annoyance factor, he will conclude that massage does not lower blood pressure. However it is possible that the jackhammer noise was having a blood pressure elevation effect which masked the the blood pressure lowering effect of the massage.

Here is a diagram that illustrates this.

Figure 9.17: text.

Figure 9.18: text.

Figure 9-9

p-value

My biostatistics professor would have an aneurysm (sorry, Dr. L.!) if he saw how we are going to treat the concept of p-value. And in the bigger picture, he would be right—it is a misunderstood and misused statistical measure, and deserves a fuller and richer treatment by experimenters and statisticians.

On the other hand, the purpose of this book is to give you enough information to read massage research, not to turn you into a specialist in any given area in experimental design. So our strategy will be to understand p-value enough to use it the way most clinicians do to read research articles. and we will note that in itself, that does not fully do justice to the concept.

Sampling

Power and sample size

Confidence interval and confidence level

In the political season (and at other times, too), we often see poll results reported as a certain set of results, plus or minus a particular margin of error.

Although it’s clear from the context that it means a little uncertainty in the exact results, now that we have discussed the normal distribution, we can understand it in a little more depth.

If the poll accurately reflects the population at large, and if we repeated the poll multiple times, we would expect the results to be about the same, with only a little bit of variation. The amount that it can vary—positive or negative, since it can vary either way—is the margin of error. So if Candidate A is preferred by 68% of the population, and Candidate B by 32%, with a margin of error of +/-5%, that means that either candidate’s number could be as much as 5% too high or too low in this poll. So in reality, Candidate A may have anywhere from 63% to 73%, and Candidate B may have anywhere from 27% to 37%. That positive and negative variation around the reported percentage is the margin of error, which leads us into the concept of a confidence interval. GIGO. Statistical dead heat.

You can think of the confidence interval as a band or a range around the reported value—the “true” number lies somewhere within that band. The confidence level, by contrast, reports how confident we are that the true result lies within that band.

examples

Break

This chapter and the previous one were really dense in terms of the material we covered. But the hardest part of learning about reading research is now over, and if you have stuck with it this far, I promise you that you will find the rest of the book to be smooth sailing in comparison, building readily on what you have already learned.

Since you’ve worked so hard on the methods and statistic parts, here’s another nice bear picture for you to look at, while we take a well-earned break.

\subsection{Average}

The average is an attempt to describe qualities of a group by combining qualities of individual members of the group., The mean, median, and mode describe different ways of averaging, which tell something about the distribution of those individual qualities or values.

\subsubsection{Mean}

Although you may not have heard it referred to by that name, you're already familiar with the concept of mean: it is the kind of average commonly seen in school grading. To get the mean, you add all the results together, and then divide by number of results.

\begin{figure} [ht]

\begin{center}

\epsfxsize 3 in

\epsfbox{9-1.eps}

\end{center}

\caption{\label{9-1}Mean (average) value of 5 final exam grades.}

\end{figure}

% Student Name    Final Exam Grade

%

% 1. John    85

%

% 2. Janet    90

%

% 3. Carmen    75

%

% 4. Michael    65

%

% 5. Sally    70

%

% Total    385

%

% Mean    385÷5 = 77

%

% Figure 9-1: Mean (average) value of 5 final exam grades

\subsubsection{Example from the literature:}

\index{Author---Barlow}

Barlow 2004 investigated whether a single massage would alter the flexibility of the hamstring in physically-active young men, as measured by the value on the sit-and-reach test He included his data in Table 1, so we can calculate the mean of all the sit and reach scores for the subjects (1) before and (2) after the massage by adding all the values in the appropriate column, and then dividing by 11 (the number of subjects in the study):

\begin{figure} [ht]

\begin{center}

\epsfxsize 3 in

\epsfbox{9-2.eps}

\end{center}

\caption{\label{9-2}Sit-and-reach scores in centimeters (Barlow 2004).}

\end{figure}

The disadvantage of the mean is that it can't tell you about extreme values in the data, or how any individual compares to the group, except in the most crudely approximate way. In order to examine this limitation further, let's set up our own table including the mean score (shown in the callouts in table 1 on the previous page). In the last column, observe the difference in score for each subject from the mean score.

The statistics we have covered up until now are useful, but in order to get a clearer picture of what all the data looks like, there are more refined tools we can use to understand the relationships among the values. Standard deviation (SD) is one of those tools

\section{Preparing to discuss standard deviation}

This (standard deviation) is probably the hardest concept we are going to cover. But it is worth it, because of the value of the concept and its applicability to so many different situations. So let's break this up into small pieces to tackle one piece at a time, and see how we can use it, not only in reading massage research, but in many other situations as well.

I remember when a sad event in my childhood brought home to me the concept of a population, although I certainly didn't think about it that way at the time. When I was in fifth grade, a child at my school died. Although I didn't know the child personally, I was sad to hear the news, as was everyone else there. Then I started putting it together with what had happened the year before, when another child had died. I figured out that there must be some kind of rule that every year one child dies at our school, and that next year it could be me. That particular thought was scary enough to keep me up awake for a couple of nights.

Although I was kind of on the right track in certain ways, there were some flaws in my analysis; however, as I was 10 years old at the time, I think I can be forgiven for a certain lack of mathematical rigor. The observation that there was a pattern---the death of one child per year---was a reasonable observation for that very short time span, although if I had been paying attention longer, it is possible that there would have been many other years where no child at that school died.

But from that observation of a pattern, I went a little too far in imaging a rule'' that one child died every year---it would be better to think of it as a \textbf{description} of what \emph{did} happen, rather than as a \textbf{prescription} for what \emph{must} happen. If you think of it in that way, you can see one function that statistics serves---\emph{descriptive statistics} summarizes the data about a population or a study, and \emph{describes} in what way they are similar (central tendency) or different (variability). It takes a very diverse group, and tries to convey concisely and efficiently to the audience what the important measures of that group are. The statistical measures we have gone over up until now---mean, median, mode, and percentile---are descriptive statistics.

\emph{Inferential statistics} takes things a step farther---it lets us use reasoning to \emph{infer}, or make predictions, about the group, based on what we already know. It's what I was dimly sensing when I realized that another could die at my school the next year\footnote{I was, and still am, quite happy to have been proved wrong on that prediction.}, and so came up with my rule''. The statistics we are going to talk about now are inferential statistics, and understanding the concepts of normal distribution, standard deviation, types of error, sample size and power, and inter-observer agreement will make a great deal---even most---of the massage research literature accessible to you.

\section{Standard deviation}

Standard deviation has a lot in common with the averages we discussed earlier, and we will talk about how we can use it as a kind of descriptive statistic. To understand standard deviation, however, we first have to all be on the same page about what \emph{normal distribution} means, so we're going to talk about that first, and then come back to standard deviation.

\subsection{Normal distribution}

We talked earlier in Chapter 2 about how normal'' is one of those words that has a specific, neutral meaning in science, yet has very strong connotations in everyday language. It's unfortunate that this word is so heavily loaded, as it is one of the most useful and powerful statistical concepts there is, and serves as a gateway to the world of inferential statistics. The word has been used as a weapon to enforce social and medical agendas---after I have taught a session on massage research and fibromyalgia, I've had people some up to me afterwards and tell me how painful it is to be told they are not normal'', where normal'' is a \emph{prescriptive} word for how they \emph{should} be. Let's be very clear that this is \textbf{not} how we're using the word. Our specific statistical use of the word is defined below.

First of all, think about a situation you've been in with a lot of other people---a lot of the time, a few people are extreme in some value one way or the other, but most people are pretty close to average. We've all been born, so let's consider the weight at birth in all healthy babies born in the US as our example situation. A few very big babies: 8½ (1/2) to 9 pounds, A few very small babies: 6 to 6½ (1/2) pounds Most babies somewhere around 7 or 8 pounds or so, more or less: called \emph{normal} birthweight because it forms a \emph{normal distribution}.

This is what that normal distribution looks like. The curved line is called a bell curve---a pretty descriptive name, because it is indeed shaped like a bell [bell character].

\begin{figure} [ht]

\begin{center}

\epsfxsize 3 in

\epsfbox{9-10.eps}

\end{center}

\caption{\label{9-10}text.}

\end{figure}

Figure 9-2: Bell curve showing normal distribution of birthweights

While all bell curves have the same basic features of a small tail'' at either end (representing a few extreme values) and a large bump'' in the middle (representing a lot of typical values), there can still be some dramatic differences in how the data the bell curve represents is arranged. The following are both \emph{\textbf{bell curves}}, but look how different they are from each other:

\begin{figure} [ht]

\begin{center}

\epsfxsize 3 in

\epsfbox{9-11.eps}

\end{center}

\caption{\label{9-11}text.}

\end{figure}

Figure 9-3: Two different bell curves

\begin{itemize}

\item The graph on the left is tall and narrow and drops off sharply.

\item The graph on the right is shorter and drops off much more gently.

\end{itemize}

These differences are useful, because they tell something about the data being studied---namely, about how different the extreme values are from the more typical values for that population. The standard deviation, which is coming up, will explore that distinction in more detail. So now that we are familiar with \emph{normal distributions} and \emph{bell curves}, let's return to \emph{standard deviation}, and see how that helps us with reading the massage research literature.

\section{Back to standard deviation}

We discussed earlier that the mean can sometimes be a useful way to summarize and describe the data. But the mean can be so different from that data that it does not give an accurate description of that data because the data under study is extremely high or extremely low. To put it another way, according to the Bill Gates Net Worth'' web page\footnote{Yes, there really are some people with that much spare time on their hands. You can find it at: http://bgnw.marcus5.net/bgnw.html if you like.} just now, at this moment, Bill Gates' net worth is \$27,600,000,000 (give or take). So, if I told you that \emph{on average}, Bill Gates and I each have a net worth of \$13,800,000,000---did you \emph{really} learn anything relevant and useful about me\footnote{If only! :)}? Or did you just get a graphic demonstration of how badly the mean fails when it has to deal with extreme values?

Clearly, we need a better tool for describing populations that---like our big, small, and average-sized babies---exhibit a great deal of variation, and the \emph{standard deviation (SD)} is one of those tools we can use. We won't bother with the mathematics behind the SD here, because for our purposes, I just want you to be able to recognize it when you come across it in the literature, and to understand what it means.

Sometimes you'll see the SD called the \emph{mean of the mean} [ref]---that refers to the way it is computed mathematically, and also to the way it describes data more accurately than just the mean alone does. Assuming a \emph{normal distribution} of data (our \emph{bell curve}), the standard deviation describes where in the bell curve the data lies. And so the normal distribution and standard deviation can deal with extreme data as well as more representative data. Further, a large standard deviation can indicate to the reader that there is something wrong with the data, or with the model, or with both.

First of all, let's put some further meaning on our bell curve. Below we have a bell curve where the different sections are shaded.

\begin{figure} [ht]

\begin{center}

\epsfxsize 3 in

\epsfbox{9-12.eps}

\end{center}

\caption{\label{9-12}text.}

\end{figure}

Figure 9-4: A bell curve with standard deviations

\begin{itemize}

\item The solid gray area, referred to as \emph{\textbf{1 standard deviation from the mean}}, represents the largest number of data values. Values that fall in this area of the graph are considered the most normal. We expect 68\% of our values to lie somewhere within this range.

\item The striped area, referred to as \emph{\textbf{2 standard deviations from the mean}}, represents a larger number of the values. We expect 98\% of our values to lie within this range (notice that to get from one striped range to the other, we have to go through the gray ranges, so we include that previous 68\% in our estimate of 98\%).

\item The data in the black area, referred to as \emph{\textbf{3 standard deviations from the mean}}, represents a small percentage of the values. We expect about 99.7\% of our values to lie within this range (notice that to get from one black range to the other, we have to go through the striped ranges and the gray ranges, so we include that 98\% in our estimate of about 99.7\%).

\end{itemize}

So now you can begin to see how this addresses the problem with the mean and the extreme values that we've encountered---if we know the mean, \emph{and we know how far away (how spread-out) from the mean a particular value of data is}, then we have a much more powerful tool for accurately and clearly representing the data than the mean alone is able to provide\footnote{So you get a much more accurate description of where I really am if I tell you that \emph{on average}, Bill Gates and I each have a net worth of \$13,800,000,000.00, \emph{and} that I am more than 3 SD away from that mean. Let's just leave it at that for now. :)}. This is useful because it tells us how spread-out'' the population is. The larger the SD, the more chance you should be somewhat skeptical of the study. Remember our previous two bell curves? \begin{figure} [ht] \begin{center} \epsfxsize 3 in \epsfbox{9-13.eps} \end{center} \caption{\label{9-13}text.} \end{figure} Figure 9-5: Two different bell curves with varying standard deviations A \emph{false positive error} (also called a \emph{type I error}), for our purposes, exists when it looks like the treatment, such as massage, caused an effect when it really didn't. In other word, its positive result was false. Here is a hypothetical research experiment to see the effect of massage on blood pressure to illustrate how this can happen. \begin{figure} [ht] \begin{center} \epsfxsize 3 in \epsfbox{9-14.eps} \end{center} \caption{\label{9-14}text.} \end{figure} Figure 9-6 In this experiment, the researchers concluded that massage does indeed lower blood pressure. But suppose the researchers made a change in the experimental design and instead of having the control subjects sit in a chair for one hour, they lay down on the massage table for an hour which caused a different result for the control group. \begin{figure} [ht] \begin{center} \epsfxsize 3 in \epsfbox{9-15.eps} \end{center} \caption{\label{9-15}text.} \end{figure} Figure 9-7 With this experimental design lying down on the table for one hour, without being massaged, also lowered blood pressure. The conclusion from this experimental design would be that lying on the table, and not the massage itself, lowers blood pressure. \textbf{Note that this was not a real experiment and that this may or may not be true}. Also note that this is an example of an experiment that any massage therapist can carry out. A \emph{false negative error} (also called a \emph{type II error}) exists when it looks like the treatment, such as massage, had no effect, but it really did. Here is another hypothetical experiment that demonstrates a false negative error. Suppose the 1 hour massage and also just lying on the table for an hour both resulted in no change in blood pressure. \begin{figure} [ht] \begin{center} \epsfxsize 3 in \epsfbox{9-16.eps} \end{center} \caption{\label{9-16}text.} \end{figure} Figure 9-8 However, in this hypothetical experiment, the researcher hypothetically did not pay attention to a couple of important factors. One was that the massage therapist being used to perform the massages was only available on Monday. And on Monday, there were workmen using jackhammers just outside the window. Also, suppose it is summer and the windows are open. However, when the subjects in the control group came participate in the study by just lying on the table without getting a massage, it was later in the week and the workmen were gone. \begin{figure} [ht] \begin{center} \epsfxsize 3 in \epsfbox{9-17.eps} \end{center} \caption{\label{9-17}text.} \end{figure} Figure 9-9 If the researcher is unaware of the jackhammer annoyance factor, he will conclude that massage does not lower blood pressure. However it is possible that the jackhammer noise was having a blood pressure elevation effect which masked the the blood pressure lowering effect of the massage. Here is a diagram that illustrates this. \begin{figure} [ht] \begin{center} \epsfxsize 3 in \epsfbox{9-18.eps} \end{center} \caption{\label{9-18}text.} \end{figure} Figure 9-9 Alpha ($\alpha$) The statistical measure$\alpha$is the probability of making a false positive error. In most of the research literature which you see$\alpha$, the researcher will tend to set$\alpha$at about 0.05. That 0.05 means a 5\% risk making a false positive error. In the example below, that is what Hopper sets his$\alpha$at, and he concludes that, to a 5\% or less probability of seeing an effect that is not really present, that his intervention (dynamic soft-tissue mobilization) significantly increased hamstring flexibility in the healthy male subjects he studied. Example: \begin{figure} [ht] \begin{center} \epsfxsize 3 in \epsfbox{9-19.eps} \end{center} \caption{\label{9-19}text.} \end{figure} \begin{quotation} OBJECTIVES: The purpose of this study was to investigate the effect of dynamic soft tissue mobilisation (STM) on hamstring flexibility in healthy male subjects...\textbf{The alpha level was set at 0.05.} RESULTS: Increase in hamstring flexibility was significantly greater in the dynamic STM group than either the control or classic STM groups with mean (standard deviation) increase in degrees in the HFA measures of 4.7 (4.8), -0.04 (4.8), and 1.3 (3.8), respectively. CONCLUSIONS: Dynamic soft tissue mobilisation (STM) significantly increased hamstring flexibility in healthy male subjects. (Hopper 2005) \index{Author---Hopper} \end{quotation} % Objectives Investigate the effect of dynamic soft tissue (STM) flexibility in healthy male subjects % % Alpha level Set to 0.05 (5%) % % Results Group receiving dynamic STM group was significantly greater than the control group (no STM) and the classic STM group. % % Conclusion Dynamic soft tissue mobilisation (STM) significantly increased hamstring flexibility in healthy male subjects. (compare this with our other hamstring study, too). (what other hamstring study?) \section{\emph{p}-value} My biostatistics professor would have an aneurysm (sorry, Dr. L.!) if he saw how we are going to treat the concept of \emph{p}-value. And in the bigger picture, he would be right---it is a misunderstood and misused statistical measure, and deserves a fuller and richer treatment by experimenters and statisticians. On the other hand, the purpose of this book is to give you enough information to read massage research, not to turn you into a specialist in any given area in experimental design. So our strategy will be to understand \emph{p}-value enough to use it the way most clinicians do to read research articles. and we will note that in itself, that does not fully do justice to the concept. \section{Sampling} \subsection{Power and sample size} Confidence interval and confidence level In the political season (and at other times, too), we often see poll results reported as a certain set of results, \textbf{plus or minus a particular margin of error}. Although it's clear from the context that it means a little uncertainty in the exact results, now that we have discussed the normal distribution, we can understand it in a little more depth. If the poll accurately reflects the population at large, and if we repeated the poll multiple times, we would expect the results to be about the same, with only a little bit of variation. The amount that it can vary---positive or negative, since it can vary either way---is the margin of error. So if Candidate A is preferred by 68\% of the population, and Candidate B by 32\%, with a margin of error of +/- 5\%, that means that either candidate's number could be as much as 5\% too high or too low in this poll. So in reality, Candidate A may have anywhere from 63\% to 73\%, and Candidate B may have anywhere from 27\% to 37\%. That positive and negative variation around the reported percentage is the margin of error, which leads us into the concept of a \emph{confidence interval}. GIGO. Statistical dead heat. You can think of the confidence interval as a band or a range around the reported value---the true'' number lies somewhere within that band. The confidence level, by contrast, reports how confident we are that the true result lies within that band. examples This chapter and the previous one were really dense in terms of the material we covered. But the hardest part of learning about reading research is now over, and if you have stuck with it this far, I promise you that you will find the rest of the book to be smooth sailing in comparison, building readily on what you have already learned. Since you've worked so hard on the methods and statistic parts, here's another nice bear picture for you to look at, while we take a well-earned break. <cn> Chapter 4 <ct> Just Enough Statistics <cout> <h1>Learning Objectives <h1> Key Terms <h2>Boxplots <h2>Data Range <h3>Range of Means <h2>Variance and Standard Deviation Figure 4-1. The Ptolemaic model. Ptolemy’s observations of the sky led him to conclude that the planets and the sun rotated around the Earth. Figure 4-2. Epicycles. Ptolemy added the concept of epicycles, or loops, to his concept of perfect circular orbits to account for the backward motion of planets observed in the night sky. Figure 4-3: Birth Weight as a normal distribution. The majority of babies are born weighing 7 to 8 pounds. Values representing that majority form the “bump” in the bell curve. Figure 4-4. Shapes of bell curves. The steeply-sloped bell curve (left) means that its data is clustered more closely together. The gently sloped bell curve (right) represents data points that are more spread out from each other. Figure 4-5. Mean scores as average. Mean sit-and-reach values in young men are shown before and after receiving massage to the hamstring muscles. Figure 4-6. Boxplot. This boxplot shows study results (Sakurai) in regard to patients’ morphine consumption after surgery for the control group and the acupressure treatment group. Data are plotted as the medians with 25th and 75th percentiles. The asterisk shows an outlier. Figure 4-7. Comparison of boxplot and bell curve. The area represented by the box is roughly equivalent to the “normal” area of the bell curve. Figure 4-8. Standard deviation. Standard deviations in a normal distribution. Table 4-1 Test Scores as an Example of the Median Student Name Score John 94 Jane 80 Robert 75 Mary 70 Susan 65– MEDIAN Ronald 55 Sam 45 Rose 30 George 25 Table 4-2 Median and Percentile in Results from Sakurai Study Morphine Requirement Pain Score No. of Patients 25th Percentile MEDIAN 50th Percentile 75th Percentile 25th Percentile MEDIAN 50th Percentile 75th Percentile Control Group 30 27 mg 47 mg 58 mg 16 mm 29.5 mm 59 mm Treatment Group Receiving Minute Sphere Acupressure 23 25 mg 41 mg 69 mg 22 mm 40 mm 58 mm TOTAL 53 9.10 Break This chapter and the previous one were really dense in terms of the material we covered. But the hardest part of learning about reading research is now over, and if you have stuck with it this far, I promise you that you will find the rest of the book to be smooth sailing in comparison, building readily on what you have already learned. 203 204 Since you’ve worked so hard on the methods and statistic parts, here’s another nice bear picture for you to look at, while we take a well-earned break. 9.11 9.12 9.13 Exercise 1: Exercise 2: Next steps Figure 9.20: text. Now that we know what the methods and the most important (for our purposes) statistics are, let’s move on to look at how study data is reported in the “Results” section. Alpha (α) Figure 9.17: text. Figure 9.18: text. The statistical measure α is the probability of making a false positive error. In most of the research literature which you see α, the researcher will tend to set α at about 0.05. That 0.05 means a 5% risk making a false positive error. In the example below, that is what Hopper sets his α at, and he concludes that, to a 5% or less probability of seeing an effect that is not really present, that his intervention (dynamic soft-tissue mobilization) significantly increased hamstring flexibility in the healthy male subjects he studied. Example: OBJECTIVES: The purpose of this study was to investigate the ef- fect of dynamic soft tissue mobilisation (STM) on hamstring flexibility in healthy male subjects...The alpha level was set at 0.05. RE- SULTS: Increase in hamstring flexibility was significantly greater in the 195 196 Figure 9.19: text. dynamic STM group than either the control or classic STM groups with mean (standard deviation) increase in degrees in the HFA measures of 4.7 (4.8), -0.04 (4.8), and 1.3 (3.8), respectively. CONCLUSIONS: Dy- namic soft tissue mobilisation (STM) significantly increased hamstring flexibility in healthy male subjects. (Hopper 2005) (compare this with our other hamstring study, too). (what other hamstring study?) 9.8 p-value My biostatistics professor would have an aneurysm (sorry, Dr. L.!) if he saw how we are going to treat the concept of p-value. And in the bigger picture, he would 197 be right—it is a misunderstood and misused statistical measure, and deserves a fuller and richer treatment by experimenters and statisticians. On the other hand, the purpose of this book is to give you enough information to read massage research, not to turn you into a specialist in any given area in experimental design. So our strategy will be to understand p-value enough to use it the way most clinicians do to read research articles. and we will note that in itself, that does not fully do justice to the concept. 9.9 Sampling 9.9.1 Power and sample size Preparing to discuss standard deviation This (standard deviation) is probably the hardest concept we are going to cover. But it is worth it, because of the value of the concept and its applicability to so many different situations. So let’s break this up into small pieces to tackle one piece at a time, and see how we can use it, not only in reading massage research, but in many other situations as well. I remember when a sad event in my childhood brought home to me the concept of a population, although I certainly didn’t think about it that way at the time. 185 186 Figure 9.8: text. When I was in fifth grade, a child at my school died. Although I didn’t know the child personally, I was sad to hear the news, as was everyone else there. Then I started putting it together with what had happened the year before, when another child had died. I figured out that there must be some kind of rule that every year one child dies at our school, and that next year it could be me. That particular thought was scary enough to keep me up awake for a couple of nights. Although I was kind of on the right track in certain ways, there were some flaws in my analysis; however, as I was 10 years old at the time, I think I can be forgiven for a certain lack of mathematical rigor. The observation that there was a pattern—the death of one child per year—was a reasonable observation for that very short time span, although if I had been paying attention longer, it is possible that there would have been many other years where no child at that school died. But from that observation of a pattern, I went a little too far in imaging a “rule” that one child died every year—it would be better to think of it as a description of what did happen, rather than as a prescription for what must happen. If you think of it in that way, you can see one function that statistics serves—descriptive Figure 9.9: text. statistics summarizes the data about a population or a study, and describes in what way they are similar (central tendency) or different (variability). It takes a very diverse group, and tries to convey concisely and efficiently to the audience what the important measures of that group are. The statistical measures we have gone over up until now—mean, median, mode, and percentile—are descriptive statistics. Inferential statistics takes things a step farther—it lets us use reasoning to infer, or make predictions, about the group, based on what we already know. It’s what I was dimly sensing when I realized that another could die at my school the next year1, and so came up with my “rule”. The statistics we are going to talk about now are inferential statistics, and understanding the concepts of normal distribution, standard deviation, types of error, sample size and power, 1I was, and still am, quite happy to have been proved wrong on that prediction. 187 188 and inter-observer agreement will make a great deal—even most—of the massage research literature accessible to you. Finally, one more thing about my example, and then we’ll let it go—remember in Chapter 3 when we talked about how science is about what’s common to every- one, while spirituality can be about what is unique and special? I’ve gotten the sense from some of my students, and have felt it myself, that there is something vaguely disturbing about talking about such sad events as a child’s death in terms of a population event, and I suspect that some of the aversion I’ve heard people express to science has something to do with the sense that science somehow sucks out what is special about being human. I would respectfully suggest that the two are not mutually exclusive—it is possible to operate in the two different modes at different times, as appropriate, and in that way to get the best of both—the rigor AND the compassion, as we talked about earlier. AND THAT .... 9.6 Standard deviation Standard deviation has a lot in common with the averages we discussed earlier, and we will talk about how we can use it as a kind of descriptive statistic. To understand standard deviation, however, we first have to all be on the same page about what normal distribution means, so we’re going to talk about that first, and then come back to standard deviation. 9.6.1 Normal distribution We talked earlier in Chapter 2 about how “normal” is one of those words that has a specific, neutral meaning in science, yet has very strong connotations in everyday language. It’s unfortunate that this word is so heavily loaded, as it is one of the most useful and powerful statistical concepts there is, and serves as a gateway to the world of inferential statistics. The word has been used as a weapon to enforce social and medical agendas—after I have taught a session on massage research and fibromyalgia, I’ve had people some up to me afterwards 189 and tell me how painful it is to be told they are not “normal”, where “normal” is a prescriptive word for how they should be. Let’s be very clear that this is not how we’re using the word. Our specific statistical use of the word is defined below. First of all, think about a situation you’ve been in with a lot of other people—a lot of the time, a few people are extreme in some value one way or the other, but most people are pretty close to average. We’ve all been born, so let’s consider the weight at birth in all healthy babies born in the US as our example situation. A few very big babies: 8 (1/2) to 9 pounds, A few very small babies: 6 to 6 (1/2) pounds Most babies somewhere around 7 or 8 pounds or so, more or less: called normal birthweight because it forms a normal distribution. This is what that normal distribution looks like. The curved line is called a bell curve—a pretty descriptive name, because it is indeed shaped like a bell [bell character]. Figure 9.10: text. Figure 9-2: Bell curve showing normal distribution of birthweights While all bell curves have the same basic features of a small “tail” at either end (representing a few extreme values) and a large “bump” in the middle (repre- senting a lot of typical values), there can still be some dramatic differences in how the data the bell curve represents is arranged. The following are both bell curves, but look how different they are from each other: Figure 9-3: Two different bell curves 190 Figure 9.11: text. • The graph on the left is tall and narrow and drops off sharply. • The graph on the right is shorter and drops off much more gently. These differences are useful, because they tell something about the data being studied—namely, about how different the extreme values are from the more typ- ical values for that population. The standard deviation, which is coming up, will explore that distinction in more detail. So now that we are familiar with normal distributions and bell curves, let’s return to standard deviation, and see how that helps us with reading the massage research literature. 9.7 Back to standard deviation We discussed earlier that the mean can sometimes be a useful way to summarize and describe the data. But the mean can be so different from that data that it does not give an accurate description of that data because the data under study is extremely high or extremely low. To put it another way, according to the “Bill Gates Net Worth” web page2 just now, at this moment, Bill Gates’ net worth is$27,600,000,000 (give or take). So, if I told you that on average, Bill Gates and I each have a net worth of $13,800,000,000—did you really learn anything relevant and useful about me3? Or did you just get a graphic demonstration of how badly the mean fails when it has to deal with extreme values? Clearly, we need a better tool for describing populations that—like our big, small, 2Yes, there really are some people with that much spare time on their hands. You can find it at: http://bgnw.marcus5.net/bgnw.html if you like. 3If only! :) 191 and average-sized babies—exhibit a great deal of variation, and the standard deviation (SD) is one of those tools we can use. We won’t bother with the mathematics behind the SD here, because for our purposes, I just want you to be able to recognize it when you come across it in the literature, and to understand what it means. Sometimes you’ll see the SD called the mean of the mean [ref]—that refers to the way it is computed mathematically, and also to the way it describes data more accurately than just the mean alone does. Assuming a normal distribution of data (our bell curve), the standard deviation describes where in the bell curve the data lies. And so the normal distribution and standard deviation can deal with extreme data as well as more representative data. Further, a large standard deviation can indicate to the reader that there is something wrong with the data, or with the model, or with both. First of all, let’s put some further meaning on our bell curve. Below we have a bell curve where the different sections are shaded. Figure 9.12: text. Figure 9-4: A bell curve with standard deviations 192 • The solid gray area, referred to as 1 standard deviation from the mean, represents the largest number of data values. Values that fall in this area of the graph are considered the most normal. We expect 68% of our values to lie somewhere within this range. • The striped area, referred to as 2 standard deviations from the mean, represents a larger number of the values. We expect 98% of our values to lie within this range (notice that to get from one striped range to the other, we have to go through the gray ranges, so we include that previous 68% in our estimate of 98%). • The data in the black area, referred to as 3 standard deviations from the mean, represents a small percentage of the values. We expect about 99.7% of our values to lie within this range (notice that to get from one black range to the other, we have to go through the striped ranges and the gray ranges, so we include that 98% in our estimate of about 99.7%). So now you can begin to see how this addresses the problem with the mean and the extreme values that we’ve encountered—if we know the mean, and we know how far away (how spread-out) from the mean a particular value of data is, then we have a much more powerful tool for accurately and clearly representing the data than the mean alone is able to provide4. This is useful because it tells us how “spread-out” the population is. The larger the SD, the more chance you should be somewhat skeptical of the study. Re- member our previous two bell curves? Figure 9-5: Two different bell curves with varying standard deviations # Sampling # Power and sample size Mean Although you may not have heard it referred to by that name, you’re already familiar with the concept of mean: it is the kind of average commonly seen in school grading. To get the mean, you add all the results together, and then divide by number of results. Example from the literature: Barlow 2004 investigated whether a single massage would alter the flexibility of the hamstring in physically-active young men, as measured by the value on the sit-and-reach test He included his data in Table 1, so we can calculate the mean of all the sit and reach scores for the subjects (1) before and (2) after the massage by adding all the values in the appropriate column, and then dividing by 11 (the number of subjects in the study): Figure 9.1: Mean (average) value of 5 final exam grades. The disadvantage of the mean is that it can’t tell you about extreme values in the data, or how any individual compares to the group, except in the most crudely approximate way. In order to examine this limitation further, let’s set up our own table including the mean score (shown in the callouts in table 1 on the previous page). In the last column, observe the difference in score for each subject from the mean score. A false positive error (also called a type I error), for our purposes, exists when it looks like the treatment, such as massage, caused an effect when it really didn’t. In other word, its positive result was false. Here is a hypothetical research experiment to see the effect of massage on blood pressure to illustrate how this 4So you get a much more accurate description of where I really am if I tell you that on average, Bill Gates and I each have a net worth of$13,800,000,000.00, and that I am more than 3 SD away from that mean. Let’s just leave it at that for now. :)
can happen.
Figure 9-6
Figure 9.13: text.
Figure 9.14: text.
In this experiment, the researchers concluded that massage does indeed lower blood pressure. But suppose the researchers made a change in the experimental design and instead of having the control subjects sit in a chair for one hour, they lay down on the massage table for an hour which caused a different result for the control group.
Figure 9.15: text.
Figure 9-7
With this experimental design lying down on the table for one hour, without being massaged, also lowered blood pressure. The conclusion from this experimental
193
194
design would be that lying on the table, and not the massage itself, lowers blood pressure. Note that this was not a real experiment and that this may or may not be true. Also note that this is an example of an experiment that any massage therapist can carry out.

A false negative error (also called a type II error) exists when it looks like the treatment, such as massage, had no effect, but it really did. Here is another hypothetical experiment that demonstrates a false negative error. Suppose the 1 hour massage and also just lying on the table for an hour both resulted in no change in blood pressure.
Figure 9.16: text.
Figure 9-8 However, in this hypothetical experiment, the researcher hypothetically did not pay attention to a couple of important factors. One was that the massage thera- pist being used to perform the massages was only available on Monday. And on Monday, there were workmen using jackhammers just outside the window. Also, suppose it is summer and the windows are open. However, when the subjects in the control group came participate in the study by just lying on the table without getting a massage, it was later in the week and the workmen were gone. Figure 9-9 If the researcher is unaware of the jackhammer annoyance factor, he will conclude that massage does not lower blood pressure. However it is possible that the jackhammer noise was having a blood pressure elevation effect which masked the the blood pressure lowering effect of the massage. Here is a diagram that illustrates this.
Figure 9-9

Confidence interval and confidence level

In the political season (and at other times, too), we often see poll results reported as a certain set of results, plus or minus a particular margin of error. Although it’s clear from the context that it means a little uncertainty in the exact results, now that we have discussed the normal distribution, we can understand it in a little more depth.
If the poll accurately reflects the population at large, and if we repeated the poll multiple times, we would expect the results to be about the same, with only a little bit of variation. The amount that it can vary—positive or negative, since it can vary either way—is the margin of error. So if Candidate A is preferred by 68% of the population, and Candidate B by 32%, with a margin of error of +/- 5%, that means that either candidate’s number could be as much as 5% too high or too low in this poll. So in reality, Candidate A may have anywhere from 63% to 73%, and Candidate B may have anywhere from 27% to 37%. That positive and negative variation around the reported percentage is the margin of error, which leads us into the concept of a confidence interval. GIGO. Statistical dead heat.
You can think of the confidence interval as a band or a range around the reported value—the “true” number lies somewhere within that band. The confidence level,
198
by contrast, reports how confident we are that the true result lies within that band. examples κ (kappa)

Average
The average is an attempt to describe qualities of a group by combining qualities of individual members of the group. The mean, median, and mode describe different ways of averaging, which tell something about the distribution of those individual qualities or values.

The statistics we have covered up until now are useful, but in order to get a clearer picture of what all the data looks like, there are more refined tools we can use to understand the relationships among the values. Standard deviation (SD) is one of those tools

Now that we know what the methods and the most important (for our purposes) statistics are, let's move on to look at how study data is reported in the "Results" section.On completion of this module students will be able to:

1. Describe how measurement and statistics can improve our understanding of the basis for massage therapy practice

2. Identify and understand basic concepts such as measurement scales (nominal, interval, ordinal, ratio scales), range, mean, standard deviation, normal distribution, variable, statistical significance

3. Define the difference between descriptive and inferential statistics

4. Define the difference between parametric and nonparametric statistical tests and identify key examples of each type

5. Name several common ways statistics can be manipulated to change results

* 1. Name one kind of each of the following: a descriptive study, an experimental study, an observational study.

* 2. Name three advantages and three disadvantages of the RCT research design.

* 3. Define descriptive and inferential statistics.

* 4. Write a 1-Minute Paper on why using statistics is important.

• Types of studies (experimental, correlational) (slide 9)

• Some important descriptive statistical terms (range, mean, standard deviation, normal distribution, variable, dependent variable, independent variable, operational definitions, “n”, “p”, “p value”, hypothesis, null hypothesis (slides 10-15)

• What to look for in a research paper if you are not a statistician: sample size, power, duration of follow-up, completeness of follow-up (slides 16-20)

• Common pitfalls to avoid when using statistics (Greenhalgh, 2001) (slides 21-25)

• Examples of parametric tests: t-tests, analysis of variance, multiple regression, Pearson’s product moment correlation coefficient (slides 33-38)

• Examples of nonparametric tests: Chi-square, Mann-Witney U, Spearman’s rank correlation coefficient (slides 39-42)

• Does statistical significance necessarily mean clinical significance? (slide 43)

Sample Research Statistics Evaluation Form

1. Using one of the four articles you found in your electronic literature search in Module 2, fill in the following information:

• RESEARCH STUDY TITLE

• LIST STATISTICAL TESTS USED; IDENTIFY EACH AS PARAMETRIC OR NONPARAMETRIC.

• FOR EACH OF THE IDENTIFIED TESTS, INDICATE WHETHER OR NOT ITS USE WAS APPROPRIATE TO THE DATA COLLECTED.

• WAS THE SAMPLE SIZE OKAY OR NOT OKAY, AND WHY?

• WAS THE DURATION OF FOLLOW-UP OKAY OR NOT OKAY, AND WHY?

• WAS THE COMPLETENESS OF FOLLOW-UP OKAY OR NOT OKAY, AND WHY?

2. For the study you chose, list two questions or potential concerns you have about the statistical analyses the authors used and give your rationale for each question or concern:

Sample Test Questions

1. Outline three assumptions underlying parametric tests. Give an example of a parametric test and describe it.

Answer: Measurements of the dependent variable are at the interval or ratio scale level; measurements approximate a normal distribution curve; variances of the samples compared are roughly equal.

Example: Paired t-test, which compares two sets of observations in a sample, e.g., comparing the weight of infants before and after they eat.

2. Outline three assumptions underlying nonparametric tests. Give an example of a nonparametric test and describe it.

Answer: Used to measure data at the nominal or ordinal scale level; few assumptions are made about the distribution of the population; addresses ranks, medians or frequencies of data.

Example: Chi Square Test, which compares observed frequencies within categories to frequencies expected by chance, e.g., assessing whether acceptance into medical school in the UK is more likely if the candidate was born in the UK.

3. According to Greenhalgh (2001), describe three pitfalls that should be avoided when using statistics.

1. throwing all your data into a computer and reporting as significant any relationships where “p < 0.05”

2. if baseline differences between the groups favor the intervention group, not adjusting for them

3. not testing your data to see if they are normally distributed

12. Next steps

This chapter and the previous one were really dense in terms of the material we covered, but the hardest part of learning about reading research is now over. If you have stuck with it this far, I promise you that you will find the rest of the book to be relatively smooth sailing in comparison, since it builds on what you have already learned to this point.

13. Evaluate this chapter

14. Figures in this chapter

15. Tables in this chapter

16. Exercises in this chapter

Research Statistics Form (using the four articles each student identified in their literature search in Module 2).

17. Instructor material for this chapter