Dishani Sen receives her ISPAward from Henri-Jean Pollet, president of ISPA Belgium. 

Using Digital Trace Data to Generate Representative Estimates on Disease Prevalence (COVID-19 Infections) in Belgian Municipalities

Is it possible to predict the area-level prevalence of COVID-19 infections in Belgium by analyzing
self-reported symptoms on Twitter? Relying solely on hospital and clinic-focused studies has its limitations,
so researchers have been exploring the potential of digital trace data to gain a better understanding of
the prevalence of COVID-19 and the symptoms experienced by infected individuals.

There’s an optimistic future in the possibility that monitoring social media data is a viable strategy for
public health surveillance. It is a critical competency that public health organizations are investing in in
order to receive real-time signals of pandemic upticks and spread. However, social media data is often
unorganized, and a non-representative sample of the population due to demographic skew in usage
frequencies and access rates. As such, any direct estimate from a platform like Twitter is likely biased
toward certain demographics. With this in mind, an attempt is made to use tweets (digital trace data) to
make inferences about the granular level prevalence of COVID-19 infections in Belgium.

This research is about generating estimates of the incidence of COVID-19 infections, at the municipality level,
by using Multilevel Regression Post-Stratification (MrP) to account for sampling biases in the social media
sample. At first, tweets are obtained from users based on keywords derived from previous research, e.g.,
tweets mentioning fever, cough, loss of taste, fatigue, etc. Then, key demographic and geographical
features of interest are extracted using the M3 deep learning pipeline, as well as simple self-reported
characteristics, effectively transforming the unstructured twitter sample into a survey-like object. Finally,
based on these demographic features and census characteristics, a mixed effects logistic regression model
with post-stratification according to the Belgian census is proposed to forecast the number of infected
individuals on a particular day. This study contributes to the proof of concept of a complete end to end
pipeline to perform real time predictions of disease prevalence at a granular level in a population using
social media data. Through this POC, contributions are made to three core elements: collecting
mass-scaled tweets, extracting demographic features and assigning a location value to convert
unstructured digital data to survey-like objects, and using a multi-level regression model with
post-stratification to make real-time predictions on the population using digital trace data.

The study’s overall hypothesis was that the area-level prevalence of COVID-19 at the municipal level can
be modeled using MrP on features extracted from aggregated tweets to generate representative estimates.
The results of the study are similar to actual data on the prevalence of COVID-19 infections in Belgium
for a reference period by a correlation of 93%. This strong positive correlation is a very promising
indication that there is an enormous signal in the Twitter data and that, this methodology has high potential
in digital epidemiology.

This page is also available in: Dutch French