In the second of our blogs reflecting back and looking ahead, we welcome Associate Professor Paul Norman, University of Leeds, reflecting on changing data sources for contemporary research.
Population geographers started the ‘twenty-teens’ looking forward to the then census rounds, though internationally there had been more than a few rumblings that the era of censuses was coming to end. In the UK we had had a successful ‘save our small area statistics’ campaign prior to the 2011 Census. Despite the demonstrable research utility of census data, we knew then that ‘cost-benefit’ was close to impossible to establish and that the census future was unsure. Since that future was being pushed towards increased use of survey and more particularly administrative data both academic and national statistics agencies (ONS, SRS, NISRA) work has moved towards establishing whether population counts and attributes can be estimated using alternative sources. There wasn’t the same bun fight about whether there would be a 2021 Census but given the indications are that this (sadly) really will be the last, population geography will need to rely on data not collected using a census questionnaire. Post 2021 Census work is likely to first look into the impact of largely online data collection paralleled with comparisons of census variable counts with whichever data otherwise exist which may perhaps indicate a similar kind of slightly differently-defined ‘thing’.
I do buy into the Tukey-ism of it being fine to have approximate answers to good questions. We know a sufficient amount about the pros and cons of administrative and large scale survey data such that the data future with these sources is not too scary. However, we know far less about the implications of relying on so-called Big Data. If we have a large number of observations, this does not necessarily mean we will have information with research utility. Some Big Data sources exist as if by accident (e.g. people Tweeting whilst travelling to thereby indicate commuting patterns) and some via some kind of planned data collection strategy (which for me are thereby an administrative or survey source). Whichever definition you buy into, the big ‘but’ in the likely absence of a 2031 Census, is which source you compare your Big Data-derived variable with to thereby understand the implications of bias, representativeness, etc.
The twenty-teens have been yet another decade in the which the changing decisions by data disseminators frustrate. For a while, one is readily able to access a source with socio-demographic and geographic detail which enables the research process. Then, with no warning, those data detail goalposts are moved such that the research process is constrained. Underpinning data dissemination decisions, quite rightly, are considerations of respondent confidentiality. The research community has a vested interest in respecting confidentiality, yet, with a track record of no breaches or abuses by researchers, the default position seems to be to distrust. A key issue for the ‘twenty-twenties’ for population geographers is unless ‘approved researcher’ status and ‘end user licence’ situations exist such that sources such as census and vital statistics plus administrative and survey data are readily available and with sufficiently resolved detail, research will otherwise be severely hampered.
For a more detailed consideration of the pros and cons of sources, see:
Norman P, Marshall A & Lomax N (2017) Data analytics: on the cusp of using new sources? Radical Statistics 116: 19-30 http://www.radstats.org.uk/no116/
For an imaginative use of data in a ‘Big Data’ era, see:
Clark S, Morris M & Lomax, N (2018) Estimating the outcome of UKs referendum on EU membership using e-petition data and machine learning algorithms, Journal of Information Technology & Politics, 15:4, 344-357, DOI: 10.1080/19331681.2018.1491926