In November 2011, Arthritis Care & Research published a supplement entitled Patient Outcomes in Rheumatology: A Review of Measures, was published in November of 2011 (Volume 63, Issue 11S). This issue contains 35 reviews of over 250 outcome instruments in 4 primary domains.

  1. Pathology and symptoms
  2. Function
  3. Health status and quality of life
  4. Psychological

The ARHP Research Subcommittee developed this page to highlight some of the more commonly used patient outcome measures in rheumatology. The content below is to assist clinical researchers and clinicians in learning about valid and reliable patient-oriented outcome instruments that are useful in the study and management of arthritis and other rheumatic conditions.

Key Definitions

Because various conceptual and operational definitions have been used in studies to document the effects of health care interventions on patient outcomes in arthritis, here are brief definitions of some of the terms that are used in research:

Health: defined by the World Health Organization as "not merely the absence of disease or infirmity"3but as a concept that incorporates well-being or wellness in all areas of life (physical, mental, emotional, social, spiritual). Health, according to this definition, is a broad concept incorporating disease, illness, and wellness. When considered as a dimension of quality of life, health is best thought to fall under the purview of health care providers in order to provide a health care intervention.

Health Status: an individual's relative level of wellness and illness, taking into account the presence of biological or physiological dysfunction, symptoms, and functional impairment.

Health Perceptions/Perceived Health Status: subjective ratings by the affected individual of his or her health status. Some people perceive themselves as healthy despite suffering from arthritis, while others perceive themselves as ill when no objective evidence of disease can be found.

Quality of life: an individual's satisfaction or happiness with life in domains he or she considers important1. Also known as "life satisfaction" or "subjective well-being," it is now sometimes referred to as "overall quality of life" or "global quality of life" to distinguish it from "health-related quality of life." It is the broadest of all concepts influenced by all of the dimensions of life that contribute to its richness and reward, pleasure and pain. These dimensions include, but are not limited to, health. A person's assessment of satisfaction with life involves two subjective considerations:

  1. How important a given domain is for that person
  2. How satisfied one is with that domain.

For instance a person can be unsatisfied with a domain that one considers to be of relatively little importance, and report a satisfactory overall quality of life. However, dissatisfaction with a domain of great importance to an individual, would clearly contribute to lower overall life quality.

Numerous taxonomies of life domains have been proposed by social, psychological, gerontological, and health sciences researchers based on studies of general populations of both well and ill people. A typical taxonomy is that of Flanagan2, which categorizes 15 dimensions of life quality into five domains, as shown below in the table.

Table: Flanagan’s Dimensions of Quality of Life


Quality of Life Dimensions

Physicalandmaterial well-being

Health and personal safety

Relations with other people

Relations with spouse
Having and rearing children
Relations with parents, siblings, or other relatives
Relations with friends

Social, community, civic activities

Helping and encouraging others
Participating in local and governmental affairs

Personal development, fulfillment

Intellectual development
Understanding and planning
Occupational role career
Creativity and personal expression


Socializing with others
Passive and observational recreational activities
Participating in active recreation

Functional status: an individual's ability to perform normal daily activities required to meet basic needs, fulfill usual roles, and maintain health and well-being5,6. Functional status includes functional capacity and functional performance. Functional status can be influenced by biological or physiological impairment, symptoms, mood, and other factors.6 It is also likely to be influenced by health perceptions. For example, a person whom most would judge to be well but who views him/herself as ill may have a low level of functional performance in relation to his capacity.5

Functional capacity: represents an individual's capacity to perform daily activities in the physical, psychological, social, and spiritual domains of life. Example - A maximal exercise test measures physical functional capacity.

Functional performance: refers to the activities people actually do during the course of their daily lives.5 Example - A self-report of activities of daily living measures functional performance.

Mood: refers to emotional responses to stressors such as changes in health state. These emotional reactions to life experiences are usually reflected in an individual's affect: the face one presents to the world.

  1. Mood describes a sustained emotional response that, when persistent, can color a person's view of the world.
  2. Depression, anxiety, and anger are emotions that sometimes coexist with physical illness, and may affect the individual's functional performance, symptom and health perceptions, and quality of life.6,12,13 Conversely, decreased functional status may contribute to depressed mood in people with chronic lung disease.12

Symptoms: are patients' perceptions of "an abnormal physical, emotional, or cognitive state"6.

Measurement Terms – Psychometric Information

Reliability: The extent to which an instrument yields the same results on repeated measures when the underlying construct being measured has not changed.

Inter-rater reliability: Demonstrates the equivalence or agreement among raters who are collecting the same data. Inter-rater reliability is appropriate when the participants in a study are being tested by the researcher. It is not appropriate when research participants are rating their own behavior, perceptions, opinions, or attitudes.

Intra-rater reliability: Demonstrates the equivalence or agreement within a rater who is collecting data.

Test-retest reliability: Demonstrates the stability or consistency of a measure.  It is not appropriate to use when measuring a non-stable construct. To determine stability, a measurement is taken at baseline and then repeated once more after a certain interval of time. The data from the two measurements are compared to give a measure of stability.

Equivalency reliability: The extent to which two items measure identical concepts at an identical level of difficulty. Equivalency reliability is determined by relating two sets of test scores to one another to highlight the degree of relationship or association. Equivalency reliability is concerned with correlational, not causal, relationships.

Internal consistency: The extent to which tests measure the same characteristic, skill, or quality. It is a measure of the precision between the observers, or of the measuring instruments used in a study. This type of reliability often helps researchers interpret data and predict the value of scores and the limits of the relationship among variables.

Responsiveness/sensitivity to change: Used to detect meaningful change in what is being measured over time, classically described as responsiveness. It involves two issues:

  1. The measure must detect meaningful change when it has occurred.
  2. The measure must remain stable when no change has occurred. The ability of a scale to detect change when it has occurred describes the scale's sensitivity to change, whereas the stability of a scale in participants who have not changed represents its specificity to change.

Validity: Defined as the accuracy with which a measurement tool measures the concept it is intended to measure. Researchers should be concerned with both internal and external validity.

Internal validity refers to:

  1. The rigor with which a study was conducted (e.g., the study's design, the care taken to conduct measurements, and decisions concerning what was and was not measured).
  2. The extent to which the designers of a study have taken into account alternative explanations for any causal relationships they explored.

In studies that did not explore causal relationships, only the first of these definitions should be considered when assessing internal validity.

External validity refers to the extent to which the results of a study are generalizable.

Content validity: Based on the extent to which a measurement reflects the specific intended domain of content. For example, a researcher needing to measure an attitude like self-esteem must decide what constitutes a relevant domain of content for that attitude. For socio-cultural studies, content validity forces the researchers to define the very domains they are attempting to study.

Face validity: Concerned with how a measure appears. Does it seem like a reasonable way to gain the information the researcher is attempting to obtain? Does it appear well-designed? Does it appear as though it will work reliably? Unlike content validity, face validity does not depend on established theories for support.

Criterion-related validity: Used to demonstrate the accuracy of a measure by comparing it with another measure that has been demonstrated to be both reliable and valid.

For example, imagine that a hands-on driving test has been shown to be a reliable and valid measure of driving skills. By comparing the scores on the written driving test with the scores from the hands-on driving test, the written test can be validated by using a criterion-related strategy in which the hands-on driving test is compared to the written test.

Construct validity: Seeks agreement between a theoretical concept and a specific measuring instrument. For example, a researcher inventing a new IQ test might spend a great deal of time attempting to theoretically "define" intelligence in order to reach an acceptable level of construct validity.

Construct validity can be broken down into two sub-categories:

  1. Convergent validity is the actual general agreement among ratings, gathered independently of one another, where measures should be theoretically related.
  2. Divergent validity is the lack of a relationship among measures which theoretically should not be related.

