Data Analysis After Record Linkage
May 12 – 13:00-16:00
Room: Station Master Room, First Floor
As a common and detailed approach to data integration, record linkage is essential to match data on the same entity spread across multiple files. At the same time, record linkage is not necessarily error-free. Data belonging to different entities may be linked incorrectly, or links between data on the same entity may be missed. In consequence, the quality of the resulting data can be significantly reduced. It is, therefore, advisable to suitably adjust downstream statistical analysis to account for potential bias caused by data contamination because of incorrect links or sample selection introduced by missing links. However, information pertinent to adjustment may be limited or absent due to reasons such as privacy considerations. This challenge especially occurs in the secondary analysis setting, which is becoming increasingly important as data users may not be able or willing to perform record linkage. This beginner-level course will equip attendees to (1) recognize possible sources and consequences of linkage errors, (2) identify methods to account for linkage errors in the secondary analysis setting, (3) use R software to conduct such data analysis in practice, and (4) discuss open problems based on the existing methodologies and their software implementation. Attendees will have the option to run R code presented, in real-time. Set-up instructions will be made available beforehand, and we will walk through code and output from example case studies step-by-step during the session.
Instructors: Brady T. West; Priyanjali Bukke.
Brady T. West is a Research Professor in the Survey Methodology Program, located within the Survey Research Center at the Institute for Social Research on the University of Michigan-Ann Arbor (U-M) campus. He earned his PhD from the Michigan Program in Survey and Data Science in 2011. Before that, he received an MA in Applied Statistics from the U-M Statistics Department in 2002, being recognized as an Outstanding First-year Applied Masters student, and a BS in Statistics with Highest Honors and Highest Distinction from the U-M Statistics Department in 2001. His current research interests include the implications of measurement error in auxiliary variables and survey paradata for survey estimation, selection bias in surveys, responsive/adaptive survey design, interviewer effects, and multilevel regression models for clustered and longitudinal data. He is the lead author of a book comparing different statistical software packages in terms of their mixed-effects modeling procedures (Linear Mixed Models: A Practical Guide using Statistical Software, Third Edition, Chapman Hall/CRC Press, 2022), and he is also the lead author of a second book entitled Applied Survey Data Analysis (with Steven Heeringa and Pat Berglund), the third edition of which will be available in early 2025. He was elected as a Fellow of the American Statistical Association in 2022.
Priyanjali Bukke is a Ph.D. student in Statistics at the University of Virginia. Her research interests include data integration and its relation to data privacy and quality. Supported by the NSF, Priyanjali is involved in a collaborative project under the supervision of Martin Slawski and Brady West to develop a new framework for analyzing data resulting from imperfectly merging files. She is also maintaining open-source software in R to implement this framework.