Don’t let sleeping Musk Oxen lie - Detecting Bias in Open Science Datasets

Much, if not most, of contemporary large-scale ecological research relies on data sourced from open science data repositories. One particularly notable example of such repositories is the Global Biodiversity Information Facility (GBIF) which makes available more than 2.5 billion records of observations of animals, plants, and fungi across the Globe.

Despite the ubiquity of GBIF-mediated data in ecological research, most data sourced through GBIF stems not from scientific studies conducted with homogenous sampling effort, but instead from independent, opportunistic citizen science observations. Consequently, our understanding of biodiversity on Earth is affected by two factors concerning observers: (1) “Where and when do people go?” and (2) “What kind of organisms are people most likely to observe and want to share their observations of?”. Additionally, properties of the organisms such as size, colour, and movement speed also play into observation likelihoods. Lastly, there exists a third category of factors influencing opportunistic observations: environmental conditions such as weather conditions, land cover, terrain properties, and many more.

These drivers of likelihood of an organism’s presence being entered into our open science data sets lead to biases in space, time, and taxonomic coverage. This issue has long been recognised by the macroecological community, but not enough has been done to understand and finally overcome these biases. With this M.Sc. project, we want to tackle the first step to overcoming these biases by identifying and quantifying them.

To quantify spatial and temporal biases in observational records sourced through GBIF, this project will focus on initially on Musk Oxen (Ovibos moschatus) in the greater Dovrefjell National Park region. This choice of target species is flexible and we invite prospective students to select other species of interest and continuously evaluate how well the chosen focus species fits the purpose of this study over the course of the project. A quick glance at our understanding of their presence within Dovrefjell National Park immediately highlights a marked spatial bias of observations being clustered close to the western edge of the E6 (see figure below).

Map of purple dots representing musk oxen

This pattern is most likely a skewed representation of the real distribution of Musk Oxen in the area and affected strongly by (1) hiking trails leading into the west but not the east in this region leading to greater presence of observers west of the E6, and (2) low-growth of vegetation to the west and forested areas to the east resulting in greater visibility of the charismatic animals to the west of E6. Within this project, we want to develop algorithmic frameworks to detect such a bias and quantify its presence. In doing so, we want to explore the effects of visiting rate, ground cover, and other important drivers of spatial observation biases.

Complementing this framework for detection of spatial biases, the project will subsequently adjust the spatial framework to also render strength of temporal biases. It is pretty clear that more musk oxen are observed in summer than in winter (see below), but how strong is this effect and can it be linked to weather patterns at the time of observation? Using state-of-the-art weather data pipelines, this M.Sc. project will shed light on these causal relationships.

Graph of observation per month

Having accomplished this goal of creating a tool for quantification of spatial and temporal biases in open science data sets, the project ought to broaden the focus of the computational analysis by adjusting the newly developed bias-detection framework to investigate taxonomic bias of observations in the area. With this final step of analyses, we will be able to compare how the detected biases and their drivers change our understanding of charismatic species (like the musk oxen) and less charismatic species (e.g., lichen species).

This work will form a valuable and timely contribution to the ecological research landscape and open up exciting opportunities for continued academic growth.

What will you learn?

You will acquire a strong computational mindset and considerable expertise in programming and statistical analyses of ecologically relevant data. Additionally, you will also gain a thorough understanding of frequently used biodiversity data repositories.

What do we offer?

During this project with us, your place of work will be at the Natural History Museum at the University of Oslo – the Norwegian representative of the GBIF network. Here, you will find a picturesque working environment with a warm and welcoming set of colleagues as well as easy access to considerable high-performance computing resources required for your work.

What do we expect from you?

Our ideal candidate is computationally minded and motivated to explore ecological research through statistical applications and method development. We expect you to be reasonably familiar and proficient in handling large (preferably spatially explicit) dataset and statistical analyses within R. Ideally, you are willing to begin pondering this project already in your second semester and would be interested in aligning your elective courses with what is required to be computationally literate enough for this work.

Outcomes:

The work will be summarised in a publication aimed at the highly read and cited Methods in Ecology and Evolution Journal. Publishing your work will make you a strong candidate for a continued journey within academia and set you up particularly well for a PhD. The strong focus on computational work will teach you valuable transferrable skills which will make you desirable for the job market within and without the academic sector.

Supervision:

You will be supervised jointly by Dr. Erik Kusch (Research Group Machine Readable Nature) and Prof. Micah Dunthorn (Research Group Evolution, eDNA, Genomics and Ethnobotany) at the Natural History Museum of the University of Oslo.

 

For further inquiries, please contact Erik Kusch at erik.kusch@nhm.uio.no.

 

Publisert 26. feb. 2024 15:28 - Sist endret 19. mars 2024 10:46

Veileder(e)

Omfang (studiepoeng)

60