Cook KF, Bamer AM, Amtmann D, Molton IR, Jensen MP. Six Patient-Reported Outcome Measurement Information System Short Form Measures Have Negligible Age- or Diagnosis-Related Differential Item Functioning in Individuals With Disabilities. Arch Phys Med Rehabil 2012;93:1289-91.

Measure:  Six short-form PROMIS measures: fatigue, pain interference, depression, sleep disturbance, sleep-related impairment, and satisfaction with social roles

Sample:  Convenience sample (n=2479) from the Seattle, Washington area. Four diagnostic groups were sampled from registries, advocacy organizations, and advertisements. The majority of the sample were women (68.1%), white (94.6%) with relatively high education (86% had a technical degree beyond high school or some college). Most (48.4%) had multiple sclerosis (MS).

Studied variables:  Diagnosis: MS, muscular dystrophy, post-polio syndrome, and spinal cord injury; age group (<44, 45-54, 55-64, 65+)

DIF method:  OLR with latent conditioning variable using lordif. 

DIF findings: Diagnosis.  One item was identified with non-uniform DIF for the PROMIS fatigue short form:

-     How often were you too tired to take a bath or shower?

“Diagnosis-specific item parameters were calculated for this item and DIF-corrected scores estimated.  Scores before and after DIF correction were correlated r>0.999).  Differences between corrected and uncorrected means on the fatigue short form scales were <.02 SD.” pg 1291.

No age DIF was observed.

Magnitude/Impact of DIF:  Change in pseudo R2 <0.13 and <5% change in β coefficient comparing two models (no group effect – only condition, e.g., depression variable entered and group effect model (group, e.g., age entered into the model). Items with meaningful DIF were treated by developing age or diagnosis-specific item parameters and rescoring the scales. Post hoc analyses evaluated DIF impact on individual items scores.

Recommendations/Actions taken:  Little DIF. Support was provided for use of the scales in different diagnostic categories and across the age groups studied.  However, recommendations were given for study of more representative samples, particularly with greater ethnic and educational diversity.



 Hahn EA, DeVellis RF, Bode RK, Garcia SF, Castel LD, Eisen SV, Bosworth HB, Heinemann AW, Rothrock N, Cella D, on behalf of the PROMIS Cooperative Group. Measuring social health in the Patient-Reported Outcomes Measurement Information System (PROMIS): item bank development and testing. Qual Life Res 2010;19:1035–1044.


Measure/ Domain/ Item Bank:  Social Health item bank (k=56 items each in two banks). Social Function subdomains: Satisfaction with participation in social roles; Satisfaction with participation in discretionary activities.


Sample: n=956 general population respondents; Polimetrix Internet polling sample (778 and 768 full bank respondents were available for analyses for the ability and satisfaction. Mean age: 52 (sd=18); about half were women; and 82.5% were white, non-Hispanic; 10% were Hispanic and 7.6% black; the majority (58%) had family household incomes of $50,00 or more (20% were $100,000 or more).  Only about 12% rated their health as fair or poor. 


Variables: Age (<65, 65+), education (high school/GED or less vs. higher education); gender


Short Forms:  Short forms were developed to maximize the range of difficulty by including items across the calibration range that had acceptable discrimination levels.


DIF method:  IRTLRDIF was used to examine uniform and non-uniform DIF.  A model with fully constrained parameters (set equal between groups) was compared to models that allowed parameters to be freely estimated. 


DIF findings:  Note that the Ability to participate items did not fit the IRT model and this item bank was not evaluated further. For the two Satisfaction with Participation subdomains: no items demonstrated non-uniform DIF. One item exhibited a trivial level of uniform DIF (based on Chi-Square ranking and p value (p=0.018)-


-     I am satisfied with my ability to do things for fun at home (like reading, listening to music, etc).”


Magnitude/Impact of DIF: No formal assessment; chi-square ranking and inspection appeared to be used.


Recommendations/Actions taken: Practically no DIF was observed. The authors caution that no clinical samples were included and the sample is not necessarily representative of the U.S. population.



 Revicki DA, Chen W-H, Harnam N, Cook KF, Amtmann D, Callahan LF, Jensen MP, Keefe FJ. Development and psychometric analysis of the PROMIS pain behavior item bank. Pain 2009;146:158–169.


Measure/Domain/Item Bank: PROMIS pain behavior item bank (k=52 items) after content expert review


Sample: Polimetrix web-based data collection of community sample and clinical samples with heart disease, cancer, osteo- and rheumatoid arthritis, psychiatric illness, spinal cord injury and other conditions (n=21,133; 19,601 community sample (Polimetrix) and 1,532 (clinical samples from research sites)

Studied variables: gender, age (<65, 65+), education (high school graduate and above (82-91%) vs. some high school or less).

DIF method:  IRTOLR (ordinal logistic regression), with graded response model (GRM) IRT estimates used for the conditioning variable. MULTILOG and IRTFIT were used for parameter estimation.

DIF findings:  No items with non-uniform or uniform DIF for education; for gender, the following item showed uniform DIF (no direction specified)

-       Ask for help doing things that needed to be done

For age, only uniform DIF was observed for five items:

-     When I was in pain I moved extremely slowly.

-     Pain caused me to bend over while walking.

-     When I was in pain I used a cane or something else for support.

-     When I was in pain I moved my limbs protectively.

-     When I was in pain I clenched my jaw or gritted my teeth.

Conditional on pain behavior, older people were more likely to report problems (the items were easier to endorse for older persons).

Magnitude/Impact of DIF:  Not assessed

Recommendations/Actions Taken:  Select items without evidence of DIF when assessing older populations.




Amtmann D, Cook KF, Jensen MP, Chen W-H, Choi S, Revicki D, Cella D, Rothrock N, Keefe F, Callahan L, Lai J-S. Development of a PROMIS item bank to measure pain interference. Pain 2010;150:173–182.


Measure/Domain/Item Bank:  PROMIS pain interference item bank (k=41 items) after content expert review and removal of items with poor fit


Sample: Polimetrix web-based community and other clinical samples


Studied variables: gender, age (<65, 65+), education --high school graduate and above (82-91%) vs. some high school or less.


DIF method:  IRTOLR with GRM IRT estimates used for the conditioning variable. MULTILOG and IRTFIT were used for parameter estimation.


DIF findings:  No items had non-uniform or uniform DIF for education; for gender, one item showed uniform DIF (no direction specified):

-     How much did pain interfere with your enjoyment of life?

For age, only uniform DIF was observed for eight items (no direction specified):

-     How difficult was it for you to take in new information because of pain?

-     How much did pain interfere with your ability to (a) concentrate?, (b) remember things?;

-     How often did you feel emotionally tensed because of your pain?;

-     How often did pain prevent you from (a) walking more than one mile?; (b) standing for more than one hour?; (c) standing more than 30 minutes?;

-     How irritable did you feel because of pain?


Magnitude/Impact of DIF:  Group-specific item parameters were calculated for items with significant DIF and scores based on the “corrected item parameters” were compared to those based on the original parameters.  Differences in T-scores (original minus DIF-free estimates were examined to determine how many individuals had score differences greater than ±2 times the median standard error or 2.50). The impact on gender DIF was minimal; the impact of age-related impact was greater.  Of 1276 respondents, 2.9% (n=37) had absolute score differences greater than 2.50. 


Recommendations/Actions Taken:  Select items without evidence of DIF for short forms.  Because DIF impact was considered to be small, the items were not recommended for removal from the bank.





Lai J-S, Cella D, Choi S, Junghaenel DU, Christodoulou C, Gershon R, Stone A. How item banks and their application can influence measurement practice in rehabilitation medicine: a PROMIS fatigue item bank example. Arch Phys Med Rehabil 2011;92(10 Suppl 1):S20-7.


Measure/Item Bank/Domain/subdomain:  Fatigue item bank (fatigue experience and impact) (k=112 original items)


Sample:  Polimetrix sample full bank (n=803) were used for dimensionality assessment. Calibration sample (n=14,931) for other analyses.  Average age of 803 respondents: 51.8 (SD=17.8, range 18-89); 55% women; 81% white non-Hispanic; 80% had at least some college.  The full bank sample was similar in demographics.


Studied variables:  Sex and age


DIF method:  OLR and sensitivity analyses using Mantel (extension of Mantel-Haenszel, MH) chi-square test using the stratification group variables. Observed conditioning variable.


DIF findings: Four items showed DIF for both sex and age using the calibration sample and OLR.  With OLR, 8 items were identified with age DIF and 10 with the MH methods.  MH identified 6 items with sex DIF.  The content of the items was not provided.


Magnitude/Impact of DIF:  Not reported


Recommendations/Actions taken: Four items with DIF for both age and sex using MH were removed.  



Lai et al., unpublished analyses

Measure/Item Bank/Domain/subdomain: 54 (out of 95) of the fatigue items in the CaPS Fatigue Item Bank

Sample: Cancer and general population samples

Studied variables: Cancer vs. general population

DIF method: Not provided

DIF findings: 6 items showed uniformed DIF. Cancer patients reported more fatigue:

        How often did your fatigue make you feel less alert? (FATIMP20)

        How often were you too tired to socialize with your family? (FATIMP26)

        How often did you have enough energy to exercise strenuously (FATIMP40)

        How often were you too tired to enjoy life? (FATEXP26)

        How often were you too tired to feel happy? (FATEXP28)

        How easily did you find yourself getting tired on average? (FATEXP51)

Magnitude/Impact of DIF:  Not reported

Recommendations/Actions taken: Not reported



Rose M, Bjorner JB, Becker J, Fries JF, Ware JE. Evaluation of a preliminary physical function item bank supported the expected advantages of the Patient-Reported Outcomes Measurement Information System (PROMIS). J Clin Epidemiol. 2008;61(1):17-33.


Measure/Item Bank/Domain/subdomain: Physical function item bank (k=163 original items and 70 final items) precursor to the PROMIS physical function item bank


Sample: Cross-sectional data from 7 studies (n=17,726), including ARAMIS and RAC arthritis studies, Medical Outcomes Studies and three general population studies conducted by QualityMetric and the Health Insurance Experiment.  


Studied variables: age, gender, ethnicity, race, sample (arthritis vs. general)


DIF method: Ordinal logistic regression with an observed conditioning variable. A significant age or gender group effect is indicative of uniform DIF and a significant interaction of group and functional ability as indicative of non-uniform DIF.


Magnitude/Impact of DIF: Nagelkerke coefficient of determination from LR model – proportion of variation explained by LR model - change in R2 >0.033 for combined uniform and non-uniform DIF


DIF findings: Several items exhibited DIF of high magnitude. (DIF by samples, e.g., osteoarthritis (OA), vs. rheumatoid arthritis (RA) general or medicare, unless noted)

ARAMIS study

-     HAQ:  Get in and out of bed

-     HAQ:  Wash and dry body

-     HAQ:  Dress yourself, including shoelaces and buttons

-     HAQ:  Get down a 5 pound object above your head (gender DIF)

-     HAQ:  Shampoo your hair (age DIF)

RA clinical study (DIF by sample)

-     HAQ:  Get in and out of bed

-     HAQ:  Wash and dry your body

-     HAQ:  Get down a 5 pound object above your head

-     HAQ:  Shampoo your hair

-     HAQ:  Dress yourself, including shoelaces and buttons

-     SF-36:  Bathing or dressing yourself


Conditional on functional disability, women in the general population study reported less difficulties doing light housework than men. Older patients with osteoarthritis reported more difficulty shampooing their hair compared with younger patients at the same level of disability.  Women with osteoarthritis were more likely to report problems reaching up to get down a bag of sugar than men at the same level of disability.  This difference was not observed in the rheumatoid arthritis sample.


Other DIF findings were not provided.


Recommendations/Actions taken: Items with DIF were excluded from the bank.


Evaluation of the Equivalence of English- and Spanish-Language Physical Functioning Items

Paz SH, Spritzer KL, Morales LS, Hays RD. Evaluation of the Patient-Reported Outcomes Information System (PROMIS(®)) Spanish-language physical functioning items. Qual Life Res. 2012 Nov 3. [Epub ahead of print]


Measure/Item Bank/Domain/subdomain: The PROMIS® wave 1 English-language physical functioning bank consists of 124 items and 114 of these were translated into Spanish.


Sample: The items were administered to 640 adult Spanish-speaking Latino members of the Toluna online panel in 2010. English-speaking sample was from Polimetrix (n=1,504) The Spanish sample had fewer comorbidities, was somewhat less educated (63% with at least some college vs. 80%) and was younger (mean age 37.6 (11.3) vs. 51.1 (18.3)) than the English sample


Method: IRT threshold and discrimination parameters were estimated using Samejima’s Graded Response Model (using Multilog). DIF by language of administration was evaluated using OLR with IRT-based trait scores estimated from DIF-free anchor items after iterative purification. A pseudo R2 difference of <0.02 between nested models was used to determine anchor items.  Magnitude and individual and aggregate impact were evaluated using expected item and test scores and by examining theta estimates after fixing and freeing parameters based on DIF.


Results: Fifty of the 114 items were flagged for DIF (20 uniform and 30 non-uniform) based on an R-squared of 0.02 or above criterion. The expected physical function total score is higher for Spanish than English language speakers.  Accounting for DIF resulted in higher scores for English speakers.


Recommendations: Comparison of physical functioning scores between English and Spanish language respondents requires using a hybrid calibration for Spanish speakers. Items without DIF are scored using the English parameters (calibrations) for the 64 items without DIF and Spanish calibrations (linearly transformed to the English metric) for the items with DIF. Future analyses will examine DIF by age group (those 45 and older versus younger respondents). The authors acknowledge that the sample of Latinos may not be representative of the U.S. Latino population.





Carle AC, Cella D, Cai L, Choi SW, Crane PK, Curtis SM, Gruhl J, Lai J, Mukherjee S, Reise S, Teresi J, Thissen D, Wu EJ, Hays R. Advancing PROMIS's methodology: Results of the third PROMIS Psychometric Summit. Expert Review of Pharmacoeconomics & Outcomes Research 2011;11: 677-684.


Measure/Item Bank Domain/subdomain:  Physical Function 19 item short form The PF measure was presented at the third (2010) PROMIS psychometric summit.


Sample Three waves of data from a study of arthritis were used in the analyses:  baseline (521), 6 months (483), and one year (472). 


Studied variables The measure was examined for DIF by age using four separate methods by four teams of measurement statisticians.


DIF method: Item Response Theory for Patient Reported Outcomes (IRTPRO) was used for testing DIF across multiple groups. A method used in sensitivity analyses was OLR using lordif. A Monte Carlo simulation approach permits empirically derived threshold values, and effect size and impact measures. Multiple Group Confirmatory Factor Analysis (MGCFA) was also used.


DIF findings: The items with DIF were:

-     Does your health now limit you in doing 2 hours of physical labor? (the item was more discriminating for younger people);

-     Does your health now limit you walking more than one mile. Given equivalent levels of capability in physical activities, younger people were more likely to claim higher levels of ability to engage in physical labor and walk more than one mile.


Magnitude/Impact of DIF:  Expected item and scale scores were examined for magnitude and impact of DIF as well as NCDIF (for effect size). The item related to walking evidenced DIF of high magnitude for age groups. Moreover, this item was hypothesized by content experts to show DIF and was found in previous literature to evidence DIF.


Recommendations/Actions taken: Both items were removed from the short form; the age impact of DIF was then trivial.  Similar analyses are proposed for this study.



Teresi, JA, Ocepek-Welikson K, Kleinman M, Eimicke JP, Crane PK, Jones RN, Lai, J-S, Choi SW, Hays RD, Reeve BB, Reise SP, Pilkonis PA, Cella D. Analysis of differential item functioning in the depression item bank from the Patient Reported Outcome Measurement Information System (PROMIS): An item response theory approach. Psychol Sci Q 2009;51(2):148-180.

This was the subject of the first psychometric summit in 2007.  Four teams of measurement statisticians examined DIF in the PROMIS depression item bank using different methods.

Measure/Item Bank Domain/subdomain:  32 item depression item bank


Sample:  Polimetrix full bank sample of 379 women and 356 men; 518 lower education and 217 with college or higher; and 201 aged 65 and over and 533 under age 65; sensitivity analyses were performed for those 60 and over.


Studied variables: Age, gender and education DIF were examined by four psychometric teams using other methods in sensitivity analyses


DIF method: The IRT log-likelihood ratio test was used for DIF detection;  IRTOLR, DFIT, SIBTEST and MIMIC were used in sensitivity analyses.



Summary of DIF findings for analyses of the PROMIS Depression baseline items.
(*=After multiple comparisons e.g., Benjamini-Hochberg)
Updated June 2011

Item Wording
















I felt that I had no energy (EDDEP03)





I felt worthless (EDDEP04)


I felt that I had nothing to look forward to (EDDEP05)




I felt helpless (EDDEP06)



I withdrew from other people (EDDEP07)


I felt that nothing could cheer me up (EDDEP09)



I felt that other people did not understand me (EDDEP13)


I felt that I was not as good as other people (EDDEP14)

I felt like crying (EDDEP16)








I felt sad (EDDEP17)



I felt that I wanted to give up on everything (EDDEP19)


I felt that I was to blame for things (EDDEP21)

I felt like a failure (EDDEP22)


I had trouble feeling close to people (EDDEP23)



Across all methods, there is consistent gender DIF associated with;

-     I felt like crying

Higher conditional endorsement was observed for women; the item was a more severe indicator of depression for men than for women. This finding was both hypothesized by PROMIS content experts, and found in the literature on DIF in depression measures.


-     I had trouble enjoying the things I used to enjoy

was hypothesized to have higher conditional endorsement in men.  This was confirmed by two analyses. The item was found to have salient DIF in several of the analyses for one or more gender, age and/or education comparisons.


-     I felt that I had no energy

was hypothesized by content experts to possibly show gender and age DIF, and was confirmed by several methods to show age, gender or education DIF.  Conditional on depression, those 65 and over were more likely to report no energy. Those with lower education were more likely to endorse the item. The magnitude of DIF in several of the analyses was not high.


Magnitude/Impact of DIF: Accompanied by magnitude measures, such as the non-compensatory DIF (NCDIF) index. Scale level impact was assessed using expected scale scores, expressed as group differences in the total scale response functions.

Group impact of DIF in the depression item bank was found to be minimal when mean scale or latent trait scores were examined with and without adjustment for DIF. This result is confirmed using the expected scale scores.

Individual impact was observed for about 100 people. For example, using a cutoff of theta ≥ 1, 9.5% would be classified as depressed prior to DIF adjustment, but not after adjustment in the analyses of gender (changes were at least one half S.D. for all individuals); the comparable figures for education and age are 13.6% and 3.4%, respectively. 

Recommendations/Actions taken: Based on the results, review of hypotheses generated by content experts and findings from the literature, items with high magnitude DIF were removed from the item bank and from depression short forms.