Measurement and Methods Reference List:

Intermediate & Advanced Readings

I.       Measurement


A.     Statistical


            1.  Classical Test Theory


Lord, F.M. & Novick, M.R. (1968). Statistical Theories of Mental Test Scores (with contributions by A. Birnbaum). Reading, MA: Addison-Wesley.


This is the standard text on classical psychometric theory used to teach measurement statistics to statisticians. The foundations of reliability theory are presented and formulas fully derived. The introduction to item response theory, Part 5 (chapters 17-20) by Allan Birnbaum is a classic. This book is targeted to the intermediate to advanced student.


Nunnally, J.C. (1978). Psychometrics Theory, Second Edition. New York: McGraw Hill.


A superb introduction to measurement concepts. Math proficiency required is a solid grasp of algebra. Calculus is not required. Because of the way it is written it is most suitable for psychologists, and other social scientists. Persons without social science training and people with medical or biological interests often have trouble translating the concept from the psychological examples to medical ones.


Devellis, R.F. (1991). Scale Development: Theory and Applications. Newbury Park: Sage.


This book is helpful both for scale construction and for the evaluation of the reliability and validity of scales. It is, I think a step easier than Nunnally. It works well for people with medical interests because the nature of the conceptual approach it takes. It does not include reliability assessment for discrete data (kappa) or judge reliability via intraclass correlation coefficients. This book is useful for teaching measurement to epidemiologists and results in fewer complaints than when Nunnally is assigned.


Dunn, G. (1989). Design and Analysis of Reliability Studies: The Statistical Evaluation of Measurement Errors. New York: Oxford University Press.


Crocker, L. & Algina, J. (1986). Introduction to Classical and Modern Test Theory. Orlando: Hartcourt Brace Jovanovich.


2.  Interrater Reliability


Muller, R. & Buttner, P. (1994). A critical discussion of intraclass correlation             coefficients. Statistics in Medicine, 13, 2465-2476.


Bodian, C.A. (1994). Intraclass correlation for two-by-two tables under three              sampling designs. Biometrics, 50, 183-193.


Kraemer, H.C. & Bloch (1988). Kappa coefficients in epidemiology: An appraisal of a reappraisal. Journal of Clinical Epidemiology, 41, 959-968.


Feinstein, A.R. & Cicchetti, D.V. (1990). High agreement but low kappa: I. The problems of two paradoxes. Journal of Clinical Epidemiology, 43, 543-549.


Cicchetti, D.V. & Feinstein, A.R. (1990). High agreement but low kappa: II.             Resolving the paradoxes. Journal of Clinical Epidemiology, 43, 551-558.


Fleiss, J.L. & Cohen, J. (1973). The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement, 33, 613-619.


Bartko, J. (1994). General methodology II. Measures of agreement: A single procedure. Statistics in Medicine, 13, 737-745.


Bravo, G. & Potvin, L. (1991). Estimating the reliability of continuous measures with Cronbach’s alpha or the intraclass correlation coefficient: Toward the integration of two traditions. Journal of Clinical Epidemiology, 44, 381-390.


Fisher, A. (1993). The assessment of IADL motor skills: An application of many-faceted Rasch Analysis. The American Journal of Occupational Therapy, 47, 319-329.


Fisher, A. (1993). Measurement-related problems in functional assessment. The American Journal of Occupational Therapy, 47, 331-338.


Lunz, M. & Stahl, J. (1993). The effect of rater severity on person ability measure: A Rasch Model analysis. The American Journal of Occupational Therapy, 47, 311-317.

            3.  Structural Equation Modeling: Measurement Model


Cohen, P., Cohen, J., Teresi, J., Marchi, P. & Velez, N. (1990). Problems in the measurement of latent variables in structural equation models. Applied Psychological Measurement, 14, 183-186.


This article discusses some of the problems in the application of structural equation modeling, with a focus on the impact of the correction for attenuation on the structural equation estimates. The article makes recommendations regarding applications of the measurement component of the model.


Meredith, W. (1993). Measurement invariance, factor analysis and factorial             invariance. Psychometrika, 58, 525-543.


Nesselroade, J. & Thompson, W. (1995). Selection and related threats to group comparisons: An example comparing factorial structures of higher and lower ability groups of adult twins. Psychological Bulletin, 117, 271-284.


            4.  Modern Psychometric Theory


Lord, F.M. (1980). Applications of Item Response Theory to Practical Testing Problems. Hillsdale: Lawrence Erlbaum.


The first complete presentation of item response theory (IRT) models other than the Rasch model, this book presents the derivation of the 2 & 3 parameter models with applications and introduction to item bias analyses (now called Differential Item Functioning).


van der Linden, W.J. & Hambleton, R.K. (1995). Handbook of Modern Item             Response Theory. New York: Springer.


This book is probably a standard for exposition of modern psychometric theory, although the field is changing so rapidly it is necessary to read the statistical journal articles to keep abreast of new developments. In addition to a good introductory chapter reviewing the history of IRT, this book has several chapters dealing with multidimensional models. This is a must read for advanced students of IRT.


Hambleton, R.K., Swaminathan, H. & Rogers, H.J. (1991). Fundamentals of Item Response Theory. Newbury Park, California: Sage Publications, Inc.


This book is an excellent primer on IRT and simpler than most work in the measurement statistical literature. The basics of IRT including different methods of estimation & a review of different computer software, are presented.


Teresi, J.A., Kleinman, M. & Ocepek-Welikson, K. (in press). Modern psychometric methods for detection of differential item functioning: Application to cognitive assessment measures. Statistics in Medicine.


This paper is an overview of DIF detection and IRT for non-measurement statisticians and methodologists. This review provides the rationale for IRT methods in the context of statistical invariance and pulls together in one place the many different DIF tests currently in use.


            4a.  Differential Item Functioning


Holland, P.W. & Wainer, H. (1993). Differential Item Functioning. Hillsdale, New Jersey: Lawrence Erlbaum Associates.


This is an excellent compilation of material on differential item functioning. The chapter comparing the results of different programs is particularly helpful. Again-- unfortunately the book is already out-of-date because of the outpouring of articles on DIF in the statistical measurement literature. Nonetheless, it is a very good overview.


Camilli, G. & Shepard, L.A. (1994). Methods for Identifying Biased Test Items, Volume 4. Thousand Oaks, California: Sage Publications.


This is a good introductory level book on item bias and differential item functioning (DIF). (There is a section describing the difference between the two (bias and DIF)). Both parametric (model-based) and non-parametric methods of detecting DIF are provided.


Teresi, J.A., Golden, R.R., Cross, P.S., Gurland, B.J., Kleinman, M. & Wilder, D.E. (1995). Item bias in cognitive screening measures: Comparisons of elderly white, Afro-American, Hispanic and high and low education subgroups. Journal of Clinical Epidemiology, 48, 473-483.


Raju, N.S. (1988). The area between two item characteristic curves. Psychometrika, 53, 495-502.


The author presents formulas for computing the exact signed and unsigned areas between two item characteristic curves (ICC’s) for the one parameter model, the two-parameter model, and for the three-parameter model with equal c parameters.


Raju, N.S. (1990). Determining the significance of estimated signed and unsigned areas between two item response functions. Applied Psychological Measurement, 14, 197-207.


The author presents formulas for the means and variances of estimated signed and unsigned areas between two item response functions for the one-parameter model, the two-parameter model, and for the three-parameter model with equal c parameters. These are then used to develop significance tests for the signed and unsigned areas.


Bludgell, G.R., Raju, N.S. & Quartetti, D.A. (1995). Analysis of differential item functioning in translated instruments. Applied Psychological Measurement, 19, 4, 309-321.


The authors compare three IRT-based methods - Lord’s Chi Square test, Raju’s Signed Area and Raju’s Unsigned Area tests - and the Mantel-Haenzel procedure for detecting differential item functioning. They apply these to a numerical test and a reasoning test that were developed first in English and then translated into French. English-speaking examinees are compared to French-speaking examinees.


Raju, N.S., van der linden, W.J. & Fleer, P.F. (1995). IRT-based internal measures of differential functioning of items and tests. Applied Psychological Measurement, 19, 353-368.


Two new methods for detecting DIF are introduced - the compensatory DIF index and the noncompensatory DIF index. These are then compared with Lord’s Chi Square test, Raju’s Signed Area test, and Raju’s Unsigned Area test.

Narayanan, P. & Swaminathan, H. (1996). Identification of items that show nonuniform DIF, Applied Psychological Measurement, 20, 257-274.


This is a comparison of three procedures for detecting non-uniform DIF-the Mantel-Haenzel procedure, the simultaneous item bias test (SIB), and logistic regression. The authors compare TYPE I error rates and power.


            4b.  Software Applications


Joreskog, K. & Sorbom, D. (1981). LISREL: Analysis of Linear Structural Relationships by Maximum Likelihood and Least Squares Methods. Upsala, Sweden: University of Upsala.


Thissen, D. (1991). MULTILOGTM User's Guide. Multiple Categorical Item Analysis and Test Scoring Using Item Response Theory, Version 6.0. Chicago, Illinois: Scientific Software.


Wood, R.L., Wingersky, M.S. & Lord, F.M. (1976). LOGIST: A Computer Program for Estimating Examinee Ability and Item Characteristic Curve Parameters. Princeton, NJ: Educational Testing Service.


Wright, B.D. & Mead (1977). BICAL: Calibrating Items and Scales with the Rasch Model (Research Memorandum 23). Chicago, Illinois: Statistical Laboratory, Department of Education, University of Chicago.


Zimowski, M.F., Muraki, E., Mislevy, R.J. & Bock, R.D. BIMAINTM 2. Chicago, Illinois: Scientific Software International.


 4.b.1.   Miscellaneous


Trochim, W. (1989). An introduction to concept mapping for planning and evaluation. Evaluation and Program Planning, 12, 1-16.


B.     Substantive


            1.  Gerontological and Geriatric Assessment


Lawton, M.P. & Teresi, J. (1994). Annual Review of Gerontology and Geriatrics: Assessment Techniques, Volume 14. New York: Springer Press.


This volume contains sixteen reviews evaluating a range of assessment measures with regard to older adults. It addresses topics such as assessment of health, functional disability (ADLs), mental ability, aging and personality, depression, and pain. This book discusses the suitability, strengths, and weakness of various measures and offers current information on the state-of-the-art assessment technology.


Teresi, J., Lawton, M.P., Holmes, D. & Ory, M. (Eds.). Measurement in elderly chronic care populations. Journal of Mental Health and Aging 1996, Volume 2(3); 1997, Volume 3(1).


This two volume series contains 15 reviews by experts in each of the following areas: cognition, communication, vision, behavior, personality, depression, affect, comorbidity, ADL, social support, formal health care services, environments, health-related quality of life, personal care inputs, costs.


Teresi, J., Lawton, M.P., Ory, M. & Holmes, D. (1994). Measurement issues in chronic care populations: Dementia special care. In Holmes, D., Ory, M. & Teresi, J. (Eds.) Alzheimer Disease and Associated Disorders, 8, S144-S183.


Andresen, E., Rothenberg, B. & Zimmer, J.G. (Eds.) (1997). Assessing the Health Status of Older Adults. New York: Springer.


            2.  Measurement in Diverse Populations


The papers below provide excellent, comprehensive reviews of the conceptualization and measurement of social class and socioeconomic position in epidemiologic studies:


Krieger, N., William, D.R. & Moss, N.E. (1997). Measuring social class in U.S. public health research: Concepts, methodologies and guidelines. Ann Rev Public Health, 18, 341-378.


Krieger, N., Rowley, D.L., Herman, A. et al. (1993). Racism, sexism and social class: Implications for studies of health, disease, and well-being. Am J Prev Med, 9, 82-122.


The papers below present an analysis and critique of the use and measurement of the “race” category in epidemiology and public health research:


Williams, D.R. (1997). Race and health: Basic questions, emerging directions. Ann Epidemiol, 7, 322-333.


La Veist, T.A. (1994). Beyond dummy variables and sample selection: What health services researchers ought to know about race as a variable. Health Serv Res, 29, 1-16.


Cooper, R.S. (1994). A case study in the use of race and ethnicity in public health surveillance. Public Health Rep, 109, 46-52.


Muntaner, C., Nieto, F.J. & O’Campo, P. (1996). The bell curve: On race, social class and epidemiologic research. Am J Epidemiol, 144, 531-536.


Cooper, R. (1984). A note on the biologic concept of race and its application in epidemiologic research. Am Heart J, 108, 715-723.


            3.  Miscellaneous: (e.g., Medical Outcomes)


Kane, R. (1997). Understanding Health Care Outcomes Research. Gaithersburg, Maryland: Aspen Publishers.


This book provides an introduction to the utility of outcome data in the process of making clinical decisions. It is a resource to health services researchers and clinicians who utilize patient-centered outcome information in the context of care.


II.     Methods for Analysis of Data


A.    General


Collins, M.L. & Horn, J.L. (1991). Best Methods for the Analysis of Change: Recent Advances, Unanswered Questions, Future Directions. Washington, DC: American Psychological Association.


Cook, T. & Campbell, D. (1979). Quasi-Experimentation: Design & Analysis for Field Settings. Chicago: Rand McNally.


This is a marvelous book jam packed with important concepts that all researchers should know about. In particular it provides a scheme for evaluating the validity of a causal inference. The scheme involves four major considerations — Statistical Conclusion Validity, Internal Validity, Construct Validity, and External Validity. It provides lists of threats to each type of validity. As such it provides an impressive compendium of ways in which one can go wrong in asserting that one thing causes another. People not trained in social sciences have trouble with the language but the concepts are definitely worth the struggle.

Schaie, K.W., Campbell, R.T., Meredith, W. & Rawlings, S.C. (1988).             Methodological Issues in Aging Research. New York: Springer Publishing.


Lawton, M.P. & Herzog, R. (1989). Special Research Methods for Gerontology.Amityville, New York: Baywood Publishers.


Williams, R.H. & Zimmerman, D.W. (1996). Are simple gain scores obsolete? Applied Psychological Measurement, 20, 59-69.

Collins, L.M. (1996). Is reliability obsolete? A commentary on "Are Simple Gain Scores Obsolete?" Applied Psychological Measurement, 20, 289-292.


Humphreys, L.G. (1996). Linear dependence of gain scores on their components imposes constraints on their use and interpretation: Comment on "Are Simple Gain Scores Obsolete?" Applied Psychological Measurement, 20, 293-294.


Liu, X., Teresi, J.A. & Waternaux, C. (in press). Modeling the decline pattern in functional measures from a prevalence cohort study. Statistics in Medicine.


This paper describes two methods of examining longitudinal change in chronic care populations. One is a variant of a random effects model, while the other is an adjacent change model. The adjacent change model may be superior in terms of conditioning on group inequalities when using a non-equivalent comparison group design.


Teresi, J. (1994). Overview of methodological issues in the study of chronic care populations. Alzheimer Disease and Associated Disorders, 8, S247-S273.


Rogosa, D. (1994). Commentary: Individual trajectories as the starting points for longitudinal data analysis. Alzheimer Disease and Associated Disorders, 8, S302-S307.


Nesselroade, J. (1994). Longitudinal research with chronic care populations: Some commentary. Alzheimer Disease and Associated Disorders, 8, S299-S301.


B.     Epidemiological


Breslow, N.E. & Day, N.E. (1980). Statistical Methods in Cancer Research: The Analysis of Case-Control Cases, Volume 1. International Agency for Research on Cancer Scientific Publications.


Magaziner, J., Simonsick, E.M., Kashner, T.M. & Hebel, J.R. (1988). Patient-proxy response comparability on measures of patient health and functional status. Journal of Clinical Epidemiology, 41, 1065-1074.


Teresi, J. & Holmes, D. (1997). Reporting source bias in estimating prevalence of cognitive impairment. Journal of Clinical Epidemiology, 50, 175-184.

C.     Biostatistical


Dwyer, M., Feinleib, P., Lippert, H. & Hoffmeister (Eds.) (1992). Statistical Models for Longitudinal Studies of Health. New York: Oxford University Press.


Diggle, P.J., Liang, K.Y. & Zeger, S. L. (1996). Analysis of Longitudinal Data. Oxford Science Publication.


The book describes statistical models and methods for the analysis of longitudinal data, with a strong emphasis on applications in the biological and health sciences.


Little, R.J. & Rubin, D.B. (1987). Statistical Analysis with Missing Data. New York: John Wiley & Sons.


The book is intended for the applied statistician, and hence emphasizes examples over the precise statement of regularity conditions or proofs of theorems of missing data. Methods (ML, EM, GEM) for estimating the mean and covariance matrix of a set of variables have also been described.


Littell, R.C., Milliken, G.A., Stroup, W.W. & Wolfinger, R.D. SAS System for Mixed Models. SAS Institute Inc.


The book presents mixed model methodology with numerous examples from several application areas, ranging from basic to advanced. Knowledge of analysis of variance and regression analysis is required.


Fleiss, J.L. (1986). The Design and Analyses of Clinical Experiments. New York: John Wiley and Sons.


  Fleiss, J. Statistical Methods for Rates and Proportions.

An extremely important book for anyone wishing to conduct statistical analysis using epidemiological designs and categorical variables. Fleiss covers a broad range of topics from sensitivity and specificity to Kappa. Exposure to a good introductory course in biostatistics is required to be able to read and understand this book.


Rogosa, D.R., Brandt, D. & Zimowski, M. (1982). A growth curve approach to the measurement of change. Psychological Bulletin, 92, 726-748.


D'Agostino, R.B., Lange, N. & Ryan, L. (1992). Symposium on longitudinal data analysis: Overview. Statistics in Medicine, 11, 1801-1805.


Breslow, N.E. & Clayton, D.G. (1993). Approximate inference in generalized linear mixed models. Journal of the American Statistical Association, 88, 9-25.


The paper presents a unifying approach tying together the underlying principles of the generalized linear mixed model.


Diggle, P.J. (1988). An approach to the analysis of repeated measurements. Biometrics, 44, 959-971.


Lange, N., Carlin, B.P, & Gelfand, A.E. (1992). Hierarchical Bayes Models for the progression of HIV infection using longitudinal CD4 T-Cell numbers. JASA, 87, 615-632.


Liang, K.Y. & Zeger, S.L. (1986). Longitudinal data analysis using generalized linear models. Biometrika, 73, 13-22.


The paper presents the generalized estimating equation (GEE) method for regression analysis of repeated measurements (specifically, binary and Poisson response variables ).


Little, R.J. & Rubin, D.B. (1983). On jointly estimating parameters and missing data by maximizing the complete-data likelihood, Am. Statist, 37, 218-220.


Patel, H.I. (1991). Analysis of incomplete data from a clinical trial with repeated measurements. Biometrika, 78, 609-619.


Wei, L.J. & Stram, D.O. (1988). Analyzing repeated measurements with possibly missing observations by modeling marginal distributions. Statistics in Medicine, 7, 139-148.


Wolfinger, R.D. (1996). Heterogeneous variance covariance structures for

            repeated measures. Journal of Agricultural, Biological, and Environmental Statistics.


The primary motivation for modeling heterogeneous variances is the ability to appropriately downweight portions of the data that are highly variable and extract more information from portions of the data that are more precise. Failure to account for heterogeneity when it is present can lead to inefficient and possibly misleading inferences about fixed effects in the model.


Wu, C.F. (1983). On the convergence properties of the EM algorithm. Am. Statist, 11, 95-103.


Zeger, S.L., Liang, K.Y. & Albert, P.S. (1988). Models for longitudinal data: A generalized estimating equation approach. Biometrics, 44, 1049-1060.


Longitudinal data sets are comprised of repeated observations of an outcome and a set of covariates for each of many subjects. The paper proposes a unifying approach to such analysis for a variety of discrete and continuous outcomes.


Zeger, S.L. & Liang K.Y. (1992). An overview of methods for the analysis of longitudinal data. Statistics in Medicine, 11, 1825-1839.


The paper reviews statistical methods for the analysis of discrete and continuous longitudinal data. The relative merits of longitudinal and cross-sectional studies are discussed. Three approaches, marginal, transition and random effects models, are presented with emphasis on the distinct interpretations of their coefficients in the discrete data case.