Geospatial Predictive Modeling Sampling and Covid-19
Geographic information systems (GIS) data is important for analyzing the macro effects the human species has on our environment, both at a local level (for instance in crop analysis) as well as a global level. Recently, GIS data has been integral in helping us understand the nature of pandemics such as the Covid-19 outbreak, making “statistical model” a household name. Modeling pandemics such as Covid-19 present several challenges to data scientists since these models include not only an understanding of the pathogen but how to use GIS data within a predictive model. Data scientists who wish to use GIS data within predictive applications should not only understand the specific sampling challenges but understand how iterative data mining methodologies such as CRISP-DM may improve model fitting.
Geospatial predictive modeling combines location data with advanced modeling techniques to analyze and predict patterns found in spatial, geological, and geographical features. GIS data allows data scientists to perform advanced analysis of phenomena which have both a geological dimension (i.e., the analysis of flora or fauna in the oceans), a geographical dimension (i.e., an analysis of education level based on the proximity of schools to public transportation), or combined geological/geographical dimensions (i.e., the effect of local government policy on river pollution). GIS predictive modeling presents several unique challenges specifically related to sample design, specifically in epidemiological models such as those predicting the spread of Covid-19. However, the use of data science methodologies (such as CRISP-DM) and agile-based leadership, which encourages continual iteration, could help to alleviate these problems.
Sample design is the methodology by which samples are chosen to use to build the predictive model. Samples can be chosen by either a (a) non-random selection process, (b) unstratified and random selection process, or (c) stratified and random selection process (Li, 2019. 2). Non-random samples are samples which are chosen either using expert judgement or opportunistically (for instance, samples which are gathered based on weather patterns in geological data or, for geographic data, based on individuals who live within a certain distance to a hospital or clinic for which data is available.) (Li, 2019. 2). Although non-random sampling does increase the risk of bias due to the sampling constraints, the iterative nature of CRISP-DM allows for additional samples to be taken, i.e., “stratified random sampling with prior information” (Li, 2019. 3) to adjust for this bias during later iterations. In the case of Covid-19, initially our lack of testing data resulted in a sample which was restricted to the sickest of the sick and only those who were admitted to a hospital or urgent care center (Holmdahl, I. 2020). Further, the disease disproportionately affected older populations. This resulted in our early GIS information to have a possible age-related bias (since our data did not accurately account for the younger asymptomatic carriers) and models which were highly uncertain and “spatially heterogeneous” due to overfitting on only confirmed cases. (Holmdahl, I. 2020)
Conversely random sampling (both stratified or non-stratified) are samples which have been randomly selected within specified geographic or geological limitations, and either account for or dismiss these additional dimensions. For instance, when gathering geographical data the researcher has the option of choosing the geographical boundaries relevant to the study (i.e., neighborhood boundaries or city limits), and then may further wish to subdivide or stratify this sample to ensure that the sample population is a reflection of the target population. Often the researcher may not wish to exclude observations, but rather minimize the potential for error due to latent variables. In the case of Covid-19, recognizing the difference of impact of the disease on older populations might require us to consider mean age and density of a population in a specific geographic area when deciding the “spatial scale” of our model. (Holmdahl, I. 2020)
Understanding Data Understanding
Within the context of CRISP-DM, “Data Understanding” is the process of performing quality assurance on the sample data, ensuring accuracy and completeness. When working with geological and geographical data there are countless “hidden” variables which may have a significant impact on the final predictive model (Li, 2019. 22). These latent variables are unknown to the researcher when gathering the initial samples but may become apparent later in the modeling process. The iterative nature of CRISP-DM allows data scientists the ability to find additional data which might account for these variables. In the case of Covid-19, while our original spatial models made incorrect assumptions about the rate of transmission of the virus due to asymptomatic spreaders, later widespread testing data which included asymptomatic carriers helped to drastically improve our models. (Holmdahl, I. 2020)
Researchers should be aware of the challenges of sample selection when using GIS data and the accidental inclusion of latent or hidden variables due to sample bias, either introduced by opportunistic sampling or by random sampling where stratification is needed, and not performed, or where stratification is incomplete. As we have seen with early Covid-19 models the sample populations were biased by only including confirmed and serious cases of Covid-19 infections which required hospitalization, which disproportionately affected older individuals, and excluded asymptomatic carriers. This potentially resulted in geospatial models which did not accurately predict the spread of the disease. By iteratively understanding the data and obtaining new data through geospatial stratification, such as by introducing widespread testing away from hospitals, GIS data scientists can mitigate the effect of these latent variables.
Holmdahl, I. (2020). Wrong but Useful — What Covid-19 Epidemiologic Models Can and Cannot Tell Us. New England Journal of Medicine. 383. 303-305. DOI: 10.1056/NEJMp2016822 https://www.nejm.org/doi/full/10.1056/NEJMp2016822
Li, J. (2019). A Critical Review of Spatial Predictive Modeling Process in Environmental Sciences with Reproducible Examples in R. Applied Sciences. 9. 2048. DOI:10.3390/app9102048 https://www.mdpi.com/2076-3417/9/10/204i