Survey Research in Times of Big Data

  1. Cabrera-Álvarez, Pablo
Revista:
Empiria: Revista de metodología de ciencias sociales

ISSN: 1139-5737

Año de publicación: 2022

Título del ejemplar: El Big data en las ciencias sociales

Número: 53

Páginas: 31-51

Tipo: Artículo

DOI: 10.5944/EMPIRIA.53.2022.32611 DIALNET GOOGLE SCHOLAR lock_openDialnet editor

Otras publicaciones en: Empiria: Revista de metodología de ciencias sociales

Resumen

La encuesta es la técnica de investigación predominante en la investigación en Ciencias Sociales. Sin embargo, la aparición de otras fuentes de datos como las publicaciones en redes sociales o los datos generados por GPS suponen nuevas oportunidades para la investigación. En este escenario, algunas voces han defendido la idea de que, debido a su menor coste y la velocidad a la que se generan, los big data irán sustituyendo progresivamente a los datos de encuesta. Sin embargo, este optimismo contrasta con los problemas de calidad y accesibilidad que presentan los big data como la fata de cobertura de algunos grupos de la población o el acceso restringido a alguna de estas fuentes. Este artículo, a partir de una revisión profunda de la literatura de los últimos años, explora como la cooperación entre los big data y las encuestas resulta en mejoras significativas de la calidad de los datos y una reducción de los costes. Nowadays, while surveys still dominate the research landscape in social sciences, alternative data sources such as social media posts or GPS data open a whole range of opportunities for researchers. In this scenario, some voices advocate for a progressive substitution of survey data. They anticipate that big data, which is cheaper and faster than surveys, will be enough to answer relevant research questions. However, this optimism contrasts with all the quality and accessibility issues associated with big data such as the lack of coverage or data ownership and restricted accessibility.  The aim of this paper is to explore how, nowadays, the combination of big data and surveys results in significant improvements in data quality and survey costs.

Información de financiación

El proyecto que ha generado estos resultados ha contado con el apoyo de una beca de la Fundaci?n Bancaria?la Caixa? (ID 100010434), cuyo c?digo es LCF/BQ/ES16/11570005

Referencias bibliográficas

  • AL BAGHAL, T., SLOAN, L., JESSOP, C., WILLIAMS, M. L., BURNAP, P. (2019): “Linking Twitter and Survey Data: The Impact of Survey Mode and Demographics on Consent Rates Across Three UK Studies”, Social Science Computer Review.
  • ANSOLABEHERE, S., HERSH, E. (2012): “Validation: What big data reveal about survey misreporting and the real electorate”, Political Analysis, 20, 4, 437–459.
  • BÄHR, S., HAAS, G.-C., KEUSCH, F., KREUTER, F., TRAPPMANN, M. (2020): “Missing Data and Other Measurement Quality Issues in Mobile Geolocation Sensor Data”, Social Science Computer Review.
  • BAKER, R. (2017): Big Data. In: Total Survey Error in Practice. John Wiley & Sons, Inc., Hoboken, NJ, USA, 47–69.
  • BAKER, R., BRICK, J. M., BATES, N. A., BATTAGLIA, M., COUPER, M. P., DEVER, J. A., GILE, K. J., TOURANGEAU, R. (2013): “Summary report of the aapor task force on non-probability sampling”, Journal of Survey Statistics and Methodology, 1, 2, 90–105.
  • BIDDLE, N., BREUNIG, R., MARKHAM, F., WOKKER, C. (2019): “Introducing the Longitudinal Multi-Agency Data Integration Project and Its Role in Understanding Income Dynamics in Australia”, Australian Economic Review, 52, 4, 476–495.
  • BIEMER, P. P., PEYTCHEV, A. (2012): “Census geocoding for nonresponse bias evaluation in telephone surveys”, Public Opinion Quarterly, 76, 3, 432–452.
  • BOASE, J., LING, R. (2013): “Measuring Mobile Phone Use: Self-Report Versus Log Data”, Journal of Computer-Mediated Communication, 18, 4, 508–519.
  • BUELENS, B., BURGER, J., VAN DEN BRAKEL, J. A. (2018): “Comparing Inference Methods for Non-probability Samples”, International Statistical Review, 2, 86, 322–343.
  • BUSKIRK, T. D. (2018): “Surveying the Forests and Sampling the Trees: An overview of Classification and Regression Trees and Random Forests with applications in Survey Research”, Survey Practice, 11, 1, 1–13.
  • CALDERWOOD, L., LESSOF, C. (2009): Enhancing Longitudinal Surveys by Linking to Administrative Data. In: Lynn, P. (ed.): Methodology of Longitudinal Surveys. John Wiley & Sons, Ltd, Chichester, UK, 55–72.
  • CALLEGARO, M., YANG, Y. (2018): The Role of Surveys in the Era of “Big Data.” In: The Palgrave Handbook of Survey Research. Springer International Publishing, Cham, 175–192.
  • CARPENTER, J., KENWARD, M. (2012): Multiple Imputation and its Application.
  • CHEN, J. K., VALLIANT, R. L., ELLIOTT, M. R. (2018): “Model-assisted calibration of non-probability sample survey data using adaptive LASSO”, Survey Methodology, 44, 1, 117–145.
  • CHEW, R. F., AMER, S., JONES, K., UNANGST, J., CAJKA, J., ALLPRESS, J., BRUHN, M. (2018): “Residential scene classification for gridded population sampling in developing countries using deep convolutional neural networks on satellite imagery”, International Journal of Health Geographics, 17, 1, 1–17.
  • CONNELLY, R., PLAYFORD, C. J., GAYLE, V., DIBBEN, C. (2016): “The role of administrative data in the big data revolution in social science research”, Social Science Research, 59, 1–12.
  • COOPER, H., HEDGES, L. V., VALENTINE, J. C. (2019): The Handbook of Research Synthesis and Meta-Analysis. Russell Sage Foundation.
  • CORNWELL, E. Y., CAGNEY, K. A. (2017): “Aging in activity space: Results from smartphone-based GPS-tracking of urban seniors”, Journals of Gerontology Series B Psychological Sciences and Social Sciences, 72, 5, 864–875.
  • COUPER, M. P. (2013): “Is the sky falling? New technology, changing media, and the future of surveys”, Survey Research Methods, 7, 3, 145–156.
  • DE LEEUW, E. D., HOX, J. J., LUITEN, A. (2018): “International Nonresponse Trends across Countries and Years: An analysis of 36 years of Labour Force Survey data”, Survey Methods: Insights from the Field, 1–11.
  • DISSING, A. S., ROD, N. H., GERDS, T. A., LUND, R. (2021): “Smartphone interactions and mental well-being in young adults : A longitudinal study based on objective high-resolution smartphone data”, Scandinavian Journal of Public Health, 49, 3, 325–332.
  • DOMO (2019): Data never sleeps, https://www.domo.com/learn/data-never-sleeps-6.
  • DURRANT, G. B., MASLOVSKAYA, O., SMITH, P. W. F. (2017): “Using prior wave information and paradata: Can they help to predict response outcomes and call sequence length in a longitudinal study?”, Journal of Official Statistics, 33, 3, 801– 833.
  • EADY, G., NAGLER, J., GUESS, A., ZILINSKY, J., TUCKER, J. A. (2019): “How Many People Live in Political Bubbles on Social Media? Evidence From Linked Survey and Twitter Data”, SAGE Open, 1, 9.
  • ELLIOTT, M. R., VALLIANT, R. (2017): “Inference for Nonprobability Samples”, Statistical Science, 32, 2, 249–264.
  • ENAMORADO, T., IMAI, K. (2019): “Validating Self-Reported Turnout by Linking Public Opinion Surveys with Administrative Records”, Public Opinion Quarterly, 83, 4, 723–748.
  • EUROPEAN COMMISSION (2019): City data from LFS and Big Data.
  • EUROSTAT (2016): Internet use by individuals. https://ec.europa.eu/eurostat/ documents/2995521/7771139/9-20122016-BP-EN.pdf/f023d81a-dce2-4959-93e3- 8cc7082b6edd
  • FAY, R. E., HERRIOT, R. A. (1979): “Estimates of Income for Small Places: An Application of James-Stein Procedures to Census Data”, Journal of the American Statistical Association, 366a, 74, 269–277.
  • FERBER, R., FORSYTHE, J., GUTHRIE, H. W., MAYNES, E. S. (1969): “Validation of a National Survey of Consumer Financial Characteristics: Savings Accounts”, The Review of Economics and Statistics, 436–444.
  • FERRI-GARCÍA, R., DEL MAR RUEDA, M. (2020): “Propensity score adjustment using machine learning classification algorithms to control selection bias in online surveys”, PLoS ONE, 15, 4, 1–19.
  • FORSYTH, J., BOUCHER, L. (2015): “Why Big Data Is Not Enough”, Research World, 50, 2015, 26–27.
  • GAYO-AVELLO, D. (2012): ““I Wanted to Predict Elections with Twitter and all I got was this Lousy Paper” -A Balanced Survey on Election Prediction using Twitter Data”, CoRR.
  • GELMAN, A. (2007): “Struggles with survey weighting and regression modeling”, Statistical Science, 22, 2, 153–164.
  • GERSCHENFELD, N., KRIKORIAN, R., COHEN, D. (2004): “The Sevenfold Way”, Scientific American, 291, 4, 76–81.
  • GROVES, R. M., FOWLER, F. J., JR., COUPER, M. P., LEPKOWSKI, J. M., SINGER, E., TOURANGEAU, R. (2013): Survey Methodology, John Wiley & Sons.
  • GROVES, R. M., HEERINGA, S. G. (2006): “Responsive design for household surveys: tools for actively controlling survey errors and costs”, Journal of the Royal Statistical Society: Series A (Statistics in Society), 169, 3, 439–457.
  • GWEON, H., SCHONLAU, M., KACZMIREK, L., BLOHM, M., STEINER, S. (2017): “Three methods for occupation coding based on statistical learning”, Journal of Official Statistics, 33, 1, 101–122.
  • HAENSCHEN, K. (2018): “Self-Reported Versus Digitally Recorded: Measuring Political Activity on Facebook”, Social Science Computer Review.
  • HAND, D. J. (2018): “Statistical challenges of administrative and transaction data”, Journal of the Royal Statistical Society. Series A: Statistics in Society, 181, 3, 555– 605.
  • HE, Z., SCHONLAU, M. (2019): “Automatic Coding of Text Answers to Open-Ended Questions: Should You Double Code the Training Data?”, Social Science Computer Review, 1–12.
  • HENDERSON, M., JIANG, K., JOHNSON, M., PORTER, L. (2019): “Measuring Twitter Use: Validating Survey-Based Measures”, Social Science Computer Review, 1–21.
  • HERSH, E. D. (2015): Hacking the electorate: How campaigns perceive voters, Cambridge University Press.
  • HILL, C. A., BIEMER, P. P., BUSKIRK, T. D., CALLEGARO, M., CORDOVA CAZAR, A. L., ECK, A., JAPEC L., KIRCHNER, A., KOLENIKOV, S., LYBERG, L.E., STURGIS, P. (2019): “Exploring new statistical frontiers at the intersection of survey science and Big Data: Convergence at ‘Bigsurv18.’”, Survey Research Methods, 13, 1.
  • HSIEH, Y. P., MURPHY, J. (2017): “Total Twitter Error”, en Total Survey Error in Practice, Wiley & Sons, 23–46.
  • JÄCKLE, A., BENINGER, K., BURTON, J., COUPER, M. P. (2018): “Understanding data linkage consent in longitudinal surveys”, Understanding Society Working Paper Series, University of Essex.
  • JÄCKLE, A., GAIA, A., LESSOF, C., COUPER, M. P. (2019): “A review of new technologies and data sources for measuring household finances: Implications for total survey error”, Understanding Society Working paper Series, University of Essex.
  • JAPEC, T. F. M. I. L., KREUTER, F., BERG, M., BIEMER, P., DECKER, P., LAMPE, C., LANE, J., O’NEIL, C., USHER, A. (2015): “AAPOR Report on Big Data”, American Association for Public Opinion Research.
  • JÜRGENS, P., STARK, B., MAGIN, M. (2019): “Two Half-Truths Make a Whole? On Bias in Self-Reports and Tracking Data”, Social Science Computer Review, 1–16.
  • KALTON, G. (2019): “Developments in Survey Research over the Past 60 Years: A Personal Perspective”, International Statistical Review, 87, S1, S10–S30.
  • KERN, C., KLAUSCH, T., KREUTER, F. (2019): “Tree-based machine learning methods for survey research”, Survey Research Methods, 13, 1, 73–93.
  • KEUSCH, F., BÄHR, S., HAAS, G. C., KREUTER, F., TRAPPMANN, M. (2020): “Coverage Error in Data Collection Combining Mobile Surveys With Passive Measurement Using Apps: Data From a German National Survey”, Sociological Methods and Research.
  • KIM, J., TAM, S.-M. (2020): “Data Integration by combining big data and survey sample data for finite population inference”, International Statistical Review, 1–30.
  • KIRGIS, N. G., LEPKOWSKI, J. M. (2013): Design and Management Strategies for Paradata-Driven Responsive Design: Illustrations from the 2006-2010 National Survey of Family Growth. In: Improving Surveys with Paradata. John Wiley & Sons, Inc., Hoboken, New Jersey, 121–144.
  • KLINGWORT, J., BUELENS, B., SCHNELL, R. (2019): “Capture–Recapture Techniques for Transport Survey Estimate Adjustment Using Permanently Installed Highway-Sensors”, Social Science Computer Review.
  • KREUTER, F. (2013): Improving Surveys with Paradata: Analytic Uses of Process Information. John Wiley & Sons.
  • KÜNN, S. (2015): “The challenges of linking survey and administrative data”, IZA World of Labor.
  • LANEY, D. (2001): “META Delta”, Application Delivery Strategies. LAZER, D., BREWER, D., CHRISTAKIS, N., FOWLER, J., KING, G. (2009): “Life in
  • the network: the coming age of computational social science”, Science, 5915, 323, 721–723.
  • LOHR, S. L., RAGHUNATHAN, T. E. (2017): “Combining Survey Data with Other Data Sources”, Statistical Science, 32, 2, 293–312.
  • MCMINN, M. A., MARTIKAINEN, P., GORMAN, E., RISSANEN, H., HÄRKÄNEN, T., TOLONEN, H., LEYLAND, A. H., GRAY, L. (2019): “Validation of non-participation bias methodology based on record-linked Finnish register-based health survey data: A protocol paper”, BMJ Open, 9, 4, 1–6.
  • MERCER, A. W. (2018): Selection Bias in Nonprobability surveys: a causal inference approach, Doctoral dissertation, University of Maryland, College Park.
  • MEYER, B. D., MITTAG, N. (2019): “Using linked survey and administrative data to better measure income: Implications for poverty, program effectiveness, and holes in the safety net”, American Economic Journal: Applied Economics, 11, 2, 176–204.
  • MILLER, P. V. (2017): “Is There a Future for Surveys?”, Public Opinion Quarterly, 81, 205–212.
  • MÖLLER, J., VAN DE VELDE, R. N., MERTEN, L., PUSCHMANN, C. (2019): “Explaining Online News Engagement Based on Browsing Behavior: Creatures of Habit?”, Social Science Computer Review.
  • MORIARITY, C., SCHEUREN, F. (2001): “Statistical Matching: A Paradigm for Assessing the Uncertainty in the Procedure”, Journal of Official Statistics, 3, 17, 407.
  • MURPHY, J., HILL, C. A., DEAN, E. (2013): Social Media, Sociality, and Survey Research. In: Social Media, Sociality, and Survey Research. John Wiley & Sons, Inc., Hoboken, NJ, USA, 1–33.
  • NEGROPONTE, N. HARRINGTON, R., MCKAY, S. R., CHRISTIAN, W. (1997): “Being digital”, Computers in Physics, 11, 3, 261–262.
  • NEYMAN, J. (1934): “On the Two Different Aspects of the Representative Method: The Method of Stratified Sampling and the Method of Purposive Selection”, Journal of the Royal Statistical Society, 97, 4, 558.
  • OLSON, K., WAGNER, J. (2015): “A feasibility test of using smartphones to collect GPS information in face-to-face surveys”, Survey Research Methods, 9, 1, 1–13.
  • PARRY, H. J., CROSSLEY, H. M. (1950): “Validity of responses to survey questions”, Public Opinion Quarterly, 14, 1, 61–80.
  • PASEK, J., JANG, S. M., COBB, C. L., DENNIS, J. M., DISOGRA, C. (2014): “Can marketing data aid survey research? Examining accuracy and completeness in consumer-file data”, Public Opinion Quarterly, 78, 4, 889–916.
  • PEYTCHEV, A., RAGHUNATHAN, T. (2013): “Evaluation and Use of Commercial Data for Nonresponse Bias Adjustment”, American Association for Public opinion Research annual conference.
  • PIETSCH, A.-S., LESSMANN, S. (2018): “Topic modeling for analyzing open-ended survey responses”, Journal of Business Analytics, 2, 1, 93–116.
  • PLAYFORD, C. J., GAYLE, V., CONNELLY, R., GRAY, A. J. J. G. (2016): “Administrative social science data: The challenge of reproducible research”, Big Data and Society, 3, 2, 1–13.
  • POLIDORO, F., GIANNINI, R., CONTE, R. Lo, MOSCA, S., ROSSETTI, F. (2015): “Web scraping techniques to collect data on consumer electronics and airfares for Italian HICP compilation”, Statistical Journal of the IAOS, 31, 2, 165–176.
  • RAENTO, M., OULASVIRTA, A., EAGLE, N. (2009): “Smartphones: An emerging tool for social scientists”, Sociological Methods and Research, 37, 3, 426–454.
  • RAFEI, A., FLANNAGAN, C. A. C., ELLIOTT, M. R. (2020): “Big data for finite population inference: Applying quasi-random approaches to naturalistic driving data using bayesian additive regression trees”, Journal of Survey Statistics and Methodology, 8, 1, 148–180.
  • RAO, J. N. K., MOLINA, I. (2015): Small Area Estimation: Second Edition. John Wiley & Sons, Inc, Hoboken, NJ, USA.
  • REVILLA, M., COUPER, M. P., OCHOA, C. (2019): “Willingness of online panelists to perform additional tasks”, Methods, Data, Analyses, 13, 2, 223–251.
  • ROSSMANN, J., GUMMER, T. (2015): “Using Paradata to Predict and Correct for Panel Attrition”, Social Science Computer Review, 34, 3, 312–332.
  • SAKSHAUG, J. W., COUPER, M. P., OFSTEDAL, M. B., WEIR, D. R. (2012): “Linking Survey and Administrative Records”, Sociological Methods & Research, 41, 4, 535–569.
  • SAKSHAUG, J. W., ECKMAN, S. (2017): “Are survey nonrespondents willing to provide consent to use administrative records? Evidence from a nonresponse follow-up survey in Germany”, Public Opinion Quarterly, 81, 2, 495–522.
  • SALA, E., BURTON, J., KNIES, G. (2013): “Correlates of Obtaining Informed Consent to Data Linkage: Respondent, Interview, and Interviewer Characteristics”, Sociological Methods & Research, 41, 3, 414–439.
  • SALGANIK, M. J. (2017): Bit by Bit: Social Research in the Digital Age. Princeton University Press.
  • SAVAGE, M., BURROWS, R. (2007): “The Coming Crisis of Empirical Sociology”, Sociology, 41, 5, 885–899.
  • SCHARKOW, M. (2016): “The Accuracy of Self-Reported Internet Use—A Validation Study Using Client Log Data”, Communication Methods and Measures, 10, 1, 13– 27.
  • SCHOBER, M. F., PASEK, J., GUGGENHEIM, L., LAMPE, C., CONRAD, F. G. (2016): “Social Media Analyses for Social Measurement”, Public Opinion Quarterly, 80, 1, 180–211.
  • SCHONLAU, M., COUPER, M. P. (2016): “Semi-automated categorization of openended questions”, Survey Research Methods, 10, 2, 143–152.
  • SCOTT, P. R., JACKA, M. (2012): Auditing Social Media. John Wiley & Sons, Inc., Hoboken, NJ, USA.
  • SELB, P., MUNZERT, S. (2013): “Voter overrepresentation, vote misreporting, and turnout bias in postelection surveys”, Electoral Studies, 32, 1, 186–196.
  • SHARMA, S. N. (2019): “Paradata , Interviewing Quality , and Interviewer Effects”, Doctoral Dissertation.
  • SLOAN, L. (2017): “Who Tweets in the United Kingdom? Profiling the Twitter Population Using the British Social Attitudes Survey 2015”, Social Media + Society, 3, 1.
  • STEVENS, F. R., GAUGHAN, A. E., LINARD, C., TATEM, A. J. (2015): “Disaggregating census data for population mapping using Random forests with remotelysensed and ancillary data”, PLoS ONE, 10, 2, 1–22.
  • STIER, S., BREUER, J., SIEGERS, P., THORSON, K. (2019): “Integrating Survey Data and Digital Trace Data: Key Issues in Developing an Emerging Field”, Social Science Computer Review.
  • THOMSON, D. R., STEVENS, F. R., RUKTANONCHAI, N. W., TATEM, A. J., CASTRO, M. C. (2017): “GridSample: An R package to generate household survey primary sampling units (PSUs) from gridded population data”, International Journal of Health Geographics, 16, 1, 1–19.
  • VALLIANT, R. (2019): “Comparing Alternatives for Estimation from Nonprobability Samples”, Journal of Survey Statistics and Methodology, 1–33.
  • VRAGA, E. K., TULLY, M. (2018): “Who Is Exposed to News? It Depends on How You Measure: Examining Self-Reported Versus Behavioral News Exposure Measures”, Social Science Computer Review.
  • WANG, W., ROTHSCHILD, D., GOEL, S., GELMAN, A. (2015): “Forecasting elections with non-representative polls”, International Journal of Forecasting, 31, 3, 980–991.
  • WARD, J. S., BARKER, A. (2013): “Undefined By Data: A Survey of Big Data Definitions”, arXiv preprint arXiv:1309.5821.
  • WENZ, A., JÄCKLE, A., COUPER, M. P. (2019): “Willingness to use mobile technologies for data collection in a probability household panel”, Survey Research Methods, 13, 1, 1–22.
  • WOOLLARD, M. (2014): Administrative Data: Problems and Benefits: A perspective from the United Kingdom. SCIVERO, Berlin