Introduction

At the Research Institutes of Sweden (RISE), we work in cooperation with other organizations and companies, both in the public and private sectors, with research and innovation in Natural Language Processing (NLP). A major challenge that we often encounter is a lack of readiness with respect to data. Even if the research problem is sufficiently well defined, and the business value of the proposed solution is well described, it is often not clear what type of data is required, if it is available, or if it at all exists. We find that there is often not even a framework available to discuss issues related to data. The purpose of this document is to outline and highlight issues related to data accessibility, validity, and utility that may arise in such situations. We hope that this document may serve as a guide for working practically with data in the context of applied NLP.

The contents of this document has been published on the form of two pre-prints:

Scope

This document is concerned exclusively with data readiness in the context of NLP. Other modalities such as images, video, or sensor data are not covered, but similar considerations apply in those cases.

Work on data readiness related to other forms of data include that of Nazabal et al (2020), who address data wrangling issues from a general stand-point using a set of case studies, as well as the work by van Ooijen (2019), and Harvey and Glocker (2019) that both deal with data quality in medical imaging. We have not found any work that focuses specifically on data readiness in the context of NLP.

How to use this document

The intention is for this document to provide insights into the type of challenges one might encounter, with respect to data, when embarking on a project involving NLP. The document is focused on asking the right questions rather than providing an explicit guide that covers all possible challenges in a project: such a guiding will inevitably vary with the specific task at hand. The following four sections make up the document:

  • Data Readiness Levels introduces the notion of data readiness.
  • Project phases outlines the typical structure of a research or innovation project, and puts data readiness into context within that structure.
  • Questions for guidance is the central part of the NLP data readiness document. Use the questions to understand the data readiness of your own project.
  • Resources is a collection of links to resources related to this document.
  • How to contribute contains instructions for how to help us improve the document. Please, reach out with any questions, or suggestions!