From Polls to Patterns: How AI-Generated Data May Outpace Traditional Surveys
Jana Peters (University of Applied Sciences Duesseldorf) - Germany
Hella de Haas (University of Applied Sciences Duesseldorf) - Germany
Olaf Jandura (University of Applied Sciences Duesseldorf) - Germany
Keywords: Generative AI, Synthetic Data, Election Polls, LLM, ChatGPT
Abstract
The landscape of survey research is shifting with the emergence of tools leveraging synthetic data, challenging traditional survey methods. Some services now use AI-generated datasets to offer scalable, cost-efficient alternatives rivaling established survey institutions. These developments raise questions about their implications for the field and their ability to replicate human respondents (Argyle et al. 2022; Bisbee et al. 2024; Horton 2023). Large language models enable the creation of synthetic datasets addressing challenges like declining response rates (Stedman et al. 2019), missing data, and careless responses (Johnson 2023). However, the fidelity of these datasets remains under scrutiny. This study evaluates synthetic datasets by comparing an AI-generated dataset to a gold standard survey dataset.
The benchmark was a July 2022 survey of 3,448 respondents, drawn from online-active individuals in Germany aged 16–69 years. Using ChatGPT, we generated a synthetic dataset based on the same population and time frame. The dataset was built iteratively, adding demographic, socio-economic, and political variables step by step. Variables like gender, age, education, and income were modeled using national statistics, incorporating relationships like the influence of income on education and regional patterns. Political preferences were represented as binary indicators for party consideration, with variable relationships included.
The comparison shows similarities and differences between the datasets. Gender distributions were nearly identical, while age distributions differed: the synthetic dataset had a younger mean (M = 39.61, SD = 15.11) compared to the original (M = 45.97, SD = 14.23), likely because the original survey only reached respondents aged 18–69, despite the intended age range being 16–69 in both datasets. This highlights synthetic datasets’ potential to better realize target demographics, including harder-to-reach groups. Education and income also varied: the synthetic data had more respondents with lower secondary degrees (21.9% vs. 10.6%), contributing to more in lower-income categories (e.g., under 1000 euros: 17.5% vs. 8.8%). The synthetic data had no missing income responses, while 9.1% of original respondents left this unanswered.
Political preferences followed some trends but diverged for others. The Christian Democratic Union and Alternative for Germany correlation, slightly negative in the original data (r = -.055, p < .01), was positive in the synthetic data (r = .081, p < .001), likely grouping conservative parties. The Greens and Pirate Party correlation was higher in the synthetic dataset (r = .106, p < .001 vs. r = .076, p < .001), while the Greens and Social Democratic Party correlation, strong in the original data (r = .411, p < .001), was weaker (r = .137, p < .001).
These differences suggest the synthetic data captures ideological groupings but reduces the complexity of nuanced behaviors. While synthetic data improved demographic consistency and reduced biases, it also diminished real-world variability, such as ambivalent preferences. Like debates on merged data and statistical twins, generative AI raises questions about the quality and realism of synthetic datasets. This study will use cluster analyses to identify subgroups and assess whether synthetic data reflects the original population's diversity and structure.